---
license: apache-2.0
language:
- en
tags:
- music
- text-generation
- transformers
pipeline_tag: text-generation
library_name: transformers
---

# Stage 2 Model

# ScrapeGoatMusic Generation API

A music generation system powered by ScrapeGoatMusic, optimized for NVIDIA H100 GPUs with FastAPI integration.

## System Requirements

- NVIDIA H100 GPU
- CUDA 12.0 or higher
- Python 3.8
- 32GB+ RAM
- Ubuntu 22.04 LTS or higher

## Installation

1. Create and activate a conda environment:
```bash
conda create -n ScrapeGoatMusic python=3.8
conda activate ScrapeGoatMusic
```

2. Install PyTorch with CUDA support:
```bash
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
```

3. Install dependencies:
```bash
pip install descript-audio-codec
pip install npy_append_array soundfile
pip install fastapi uvicorn python-multipart
pip install flash-attn --no-build-isolation
```

4. Clone and install RepCodec:
```bash
cd inference/xcodec_mini_infer
git clone https://github.com/mct10/RepCodec.git
cd RepCodec
pip install .
```

5. Download required model files:
```bash
# Download models from Hugging Face
git lfs install
cd inference
git clone https://huggingface.co/Nathan9/xcodec_mini_infer
```

## API Setup

1. Create a new file `api.py`:
```python
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import FileResponse
import uvicorn
import torch
import os
import argparse
from pathlib import Path
import uuid
from typing import Optional

app = FastAPI(title="ScrapeGoatMusic Generation API")

# Initialize models and configurations
def init_models():
    parser = argparse.ArgumentParser()
    # Add all your existing arguments here
    args = parser.parse_args([])
    args.stage1_model = "Nathan9/ScrapeGoatMusic-s1-7B-anneal-en-cot"
    args.stage2_model = "Nathan9/ScrapeGoatMusic-s2-1B-general"
    args.max_new_tokens = 3000
    args.run_n_segments = 2
    args.stage2_batch_size = 4
    args.output_dir = "./output"
    args.cuda_idx = 0
    # Add other default arguments
    return args

@app.on_event("startup")
async def startup_event():
    global args
    args = init_models()
    os.makedirs(args.output_dir, exist_ok=True)

@app.post("/generate")
async def generate_music(
    genre_file: UploadFile = File(...),
    lyrics_file: UploadFile = File(...),
    audio_prompt: Optional[UploadFile] = File(None),
    prompt_start_time: float = Form(0.0),
    prompt_end_time: float = Form(30.0)
):
    # Create unique session ID
    session_id = str(uuid.uuid4())
    session_dir = Path(args.output_dir) / session_id
    os.makedirs(session_dir, exist_ok=True)

    # Save uploaded files
    genre_path = session_dir / "genre.txt"
    lyrics_path = session_dir / "lyrics.txt"
    
    with open(genre_path, "wb") as f:
        f.write(await genre_file.read())
    with open(lyrics_path, "wb") as f:
        f.write(await lyrics_file.read())

    # Handle optional audio prompt
    audio_prompt_path = None
    if audio_prompt:
        audio_prompt_path = session_dir / "audio_prompt.wav"
        with open(audio_prompt_path, "wb") as f:
            f.write(await audio_prompt.read())

    # Run inference
    try:
        # Import your inference code here
        from infer import run_inference
        output_path = run_inference(
            args,
            str(genre_path),
            str(lyrics_path),
            str(audio_prompt_path) if audio_prompt_path else None,
            prompt_start_time,
            prompt_end_time
        )
        
        return FileResponse(
            output_path,
            media_type="audio/mpeg",
            filename=f"generated_music_{session_id}.mp3"
        )
    except Exception as e:
        return {"error": str(e)}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
```

2. Create a new file `infer.py` with your existing inference code, modified to be imported as a module.

## Running the API

1. Start the API server:
```bash
python api.py
```

2. The API will be available at `http://localhost:8000`

## API Endpoints

### POST /generate
Generates music based on provided genre and lyrics.

**Parameters:**
- `genre_file`: Text file containing genre tags (Required)
- `lyrics_file`: Text file containing lyrics (Required)
- `audio_prompt`: Audio file for prompt (Optional)
- `prompt_start_time`: Start time for audio prompt (Default: 0.0)
- `prompt_end_time`: End time for audio prompt (Default: 30.0)

**Example using curl:**
```bash
curl -X POST "http://localhost:8000/generate" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "genre_file=@/path/to/genre.txt" \
  -F "lyrics_file=@/path/to/lyrics.txt" \
  -F "prompt_start_time=0.0" \
  -F "prompt_end_time=30.0"
```

**Example genre.txt format:**
```
instrumental pop energetic female vocals
```

**Example lyrics.txt format:**
```
[verse]
Your lyrics here
[chorus]
Your chorus here
```

## H100 Optimization

1. Enable Flash Attention:
```python
model = AutoModelForCausalLM.from_pretrained(
    stage1_model,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)
```

2. Optimize memory usage:
```python
# Add to your inference configuration
torch.cuda.set_device(0)  # Use first H100
torch.backends.cudnn.benchmark = True
```

3. For multi-GPU setup, modify `cuda_idx` in the API configuration.

## Monitoring

The API includes Swagger documentation at `http://localhost:8000/docs` for testing and monitoring endpoints.

## Troubleshooting

1. CUDA Out of Memory:
- Reduce `stage2_batch_size`
- Adjust `max_new_tokens`
- Use gradient checkpointing

2. Audio Quality Issues:
- Check input audio format (16kHz, mono)
- Verify genre tags format
- Ensure lyrics follow the correct structure

## Training

This model was created through a multi-stage training process optimized for music generation. You can further fine-tune the model on your own data using the following steps:

### Data Preparation

1. Prepare your training data using the provided script:
```bash
python prepare_training_data.py
```

The script expects the following directory structure:
```
training_data/
├── audio_tracks/      # 16kHz mono WAV files
├── lyrics/           # Corresponding lyrics files
└── genres/          # Genre tag files
```

### Training Requirements

- NVIDIA H100 GPU (recommended)
- 32GB+ GPU memory
- Training dataset with:
  - High-quality audio files (16kHz mono)
  - Aligned lyrics in structured format
  - Genre annotations
  - At least 10,000 samples recommended

### Fine-tuning Steps

1. Install additional training dependencies:
```bash
pip install accelerate datasets transformers
```

2. Prepare your configuration:
```bash
# For Stage 1 model (7B)
export MODEL_PATH="Nathan9/ScrapeGoatMusic-s1-7B-anneal-en-cot"
export OUTPUT_DIR="./fine_tuned_model_s1"

# For Stage 2 model (1B)
export MODEL_PATH="Nathan9/ScrapeGoatMusic-s2-1B-general"
export OUTPUT_DIR="./fine_tuned_model_s2"
```

3. Start training:
```bash
python train.py \
    --model_name_or_path $MODEL_PATH \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-5 \
    --warmup_steps 500 \
    --logging_steps 100 \
    --save_steps 1000 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --gradient_checkpointing true
```

### Training Tips

1. Stage 1 Model:
- Use larger batch sizes (8-16) for better convergence
- Enable gradient checkpointing for memory efficiency
- Start with a lower learning rate (1e-5)
- Train for at least 3 epochs

2. Stage 2 Model:
- Use smaller batch sizes (4-8)
- Higher learning rate possible (2e-5)
- Shorter training time needed
- Focus on audio quality metrics

3. Monitoring:
- Use Weights & Biases for training visualization
- Monitor loss curves for convergence
- Validate generation quality periodically
- Check for overfit on validation set

4. Performance Optimization:
- Enable Flash Attention during training
- Use mixed precision training (bf16)
- Distribute training across multiple GPUs if available
- Implement proper gradient clipping

## License

FULL ACCESS, ENJOY