File size: 1,479 Bytes
4a5b0ec
 
 
 
 
 
a0dc44c
4a5b0ec
 
 
 
 
 
 
 
 
 
 
 
67a3b3a
 
 
4a5b0ec
 
 
 
 
67a3b3a
 
4a5b0ec
 
 
67a3b3a
4a5b0ec
 
 
 
67a3b3a
 
 
4a5b0ec
 
 
 
 
 
 
 
67a3b3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9512f11
 
a0dc44c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
title: Vibesmark Test Suite
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.13.1
app_file: app.py
pinned: false
---

# Vibesmark Test Suite

A benchmarking tool for comparing different language models side by side. This application allows users to:

- Upload custom test questions
- Compare responses from different language models
- Record preferences between model outputs
- Generate summary statistics of model performance

## Setup

1. Create a `.env` file with your OpenRouter API credentials:
```
OPENROUTER_API_KEY=your_api_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1/chat/completions
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Run the application:
```bash
python app.py
```

## Usage

1. Select two models to compare
2. Upload a text file containing test questions (one per line)
3. Start the test and evaluate responses
4. View results summary when finished

## Deployment

This app is ready to deploy on Hugging Face Spaces. Just add your OpenRouter API credentials as secrets in your Space settings.

## Features

- Compare responses from different AI models side by side
- Supports up to 10 questions per benchmark
- Randomly selects different models for comparison
- Real-time response generation

## Supported Models

- Claude 3 Opus
- Claude 3 Sonnet
- Gemini Pro
- Mistral Medium
- Claude 2.1
- GPT-4 Turbo
- GPT-3.5 Turbo

## License

[Your chosen license]

Run it with 
`python app.py`