Stage 1 Model

ScrapeGoatMusic Generation API

A music generation system powered by ScrapeGoatMusic, optimized for NVIDIA H100 GPUs with FastAPI integration.

System Requirements

NVIDIA H100 GPU
CUDA 12.0 or higher
Python 3.8
32GB+ RAM
Ubuntu 22.04 LTS or higher

Installation

Create and activate a conda environment:

conda create -n ScrapeGoatMusic python=3.8
conda activate ScrapeGoatMusic

Install PyTorch with CUDA support:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Install dependencies:

pip install descript-audio-codec
pip install npy_append_array soundfile
pip install fastapi uvicorn python-multipart
pip install flash-attn --no-build-isolation

Clone and install RepCodec:

cd inference/xcodec_mini_infer
git clone https://github.com/mct10/RepCodec.git
cd RepCodec
pip install .

Download required model files:

# Download models from Hugging Face
git lfs install
cd inference
git clone scrapegoat/Neural-Audio-Codec

API Setup

Create a new file api.py:

from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import FileResponse
import uvicorn
import torch
import os
import argparse
from pathlib import Path
import uuid
from typing import Optional

app = FastAPI(title="ScrapeGoatMusic Generation API")

# Initialize models and configurations
def init_models():
    parser = argparse.ArgumentParser()
    # Add all your existing arguments here
    args = parser.parse_args([])
    args.stage1_model = "scrapegoat/ScrapeGoat-Music-Stage1"
    args.stage2_model = "scrapegoat/ScrapeGoat-Music-Stage1"
    args.max_new_tokens = 3000
    args.run_n_segments = 2
    args.stage2_batch_size = 4
    args.output_dir = "./output"
    args.cuda_idx = 0
    # Add other default arguments
    return args

@app.on_event("startup")
async def startup_event():
    global args
    args = init_models()
    os.makedirs(args.output_dir, exist_ok=True)

@app.post("/generate")
async def generate_music(
    genre_file: UploadFile = File(...),
    lyrics_file: UploadFile = File(...),
    audio_prompt: Optional[UploadFile] = File(None),
    prompt_start_time: float = Form(0.0),
    prompt_end_time: float = Form(30.0)
):
    # Create unique session ID
    session_id = str(uuid.uuid4())
    session_dir = Path(args.output_dir) / session_id
    os.makedirs(session_dir, exist_ok=True)

    # Save uploaded files
    genre_path = session_dir / "genre.txt"
    lyrics_path = session_dir / "lyrics.txt"
    
    with open(genre_path, "wb") as f:
        f.write(await genre_file.read())
    with open(lyrics_path, "wb") as f:
        f.write(await lyrics_file.read())

    # Handle optional audio prompt
    audio_prompt_path = None
    if audio_prompt:
        audio_prompt_path = session_dir / "audio_prompt.wav"
        with open(audio_prompt_path, "wb") as f:
            f.write(await audio_prompt.read())

    # Run inference
    try:
        # Import your inference code here
        from infer import run_inference
        output_path = run_inference(
            args,
            str(genre_path),
            str(lyrics_path),
            str(audio_prompt_path) if audio_prompt_path else None,
            prompt_start_time,
            prompt_end_time
        )
        
        return FileResponse(
            output_path,
            media_type="audio/mpeg",
            filename=f"generated_music_{session_id}.mp3"
        )
    except Exception as e:
        return {"error": str(e)}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Create a new file infer.py with your existing inference code, modified to be imported as a module.

Running the API

Start the API server:

python api.py

The API will be available at http://localhost:8000

API Endpoints

POST /generate

Generates music based on provided genre and lyrics.

Parameters:

genre_file: Text file containing genre tags (Required)
lyrics_file: Text file containing lyrics (Required)
audio_prompt: Audio file for prompt (Optional)
prompt_start_time: Start time for audio prompt (Default: 0.0)
prompt_end_time: End time for audio prompt (Default: 30.0)

Example using curl:

curl -X POST "http://localhost:8000/generate" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "genre_file=@/path/to/genre.txt" \
  -F "lyrics_file=@/path/to/lyrics.txt" \
  -F "prompt_start_time=0.0" \
  -F "prompt_end_time=30.0"

Example genre.txt format:

instrumental pop energetic female vocals

Example lyrics.txt format:

[verse]
Your lyrics here
[chorus]
Your chorus here

H100 Optimization

Enable Flash Attention:

model = AutoModelForCausalLM.from_pretrained(
    stage1_model,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)

Optimize memory usage:

# Add to your inference configuration
torch.cuda.set_device(0)  # Use first H100
torch.backends.cudnn.benchmark = True

For multi-GPU setup, modify cuda_idx in the API configuration.

Monitoring

The API includes Swagger documentation at http://localhost:8000/docs for testing and monitoring endpoints.

Troubleshooting

CUDA Out of Memory:

Reduce stage2_batch_size
Adjust max_new_tokens
Use gradient checkpointing

Audio Quality Issues:

Check input audio format (16kHz, mono)
Verify genre tags format
Ensure lyrics follow the correct structure

Training

This model was created through a multi-stage training process optimized for music generation. You can further fine-tune the model on your own data using the following steps:

Data Preparation

Prepare your training data using the provided script:

python prepare_training_data.py

The script expects the following directory structure:

training_data/
├── audio_tracks/      # 16kHz mono WAV files
├── lyrics/           # Corresponding lyrics files
└── genres/          # Genre tag files

Training Requirements

NVIDIA H100 GPU (recommended)
32GB+ GPU memory
Training dataset with:
- High-quality audio files (16kHz mono)
- Aligned lyrics in structured format
- Genre annotations
- At least 10,000 samples recommended

Fine-tuning Steps

Install additional training dependencies:

pip install accelerate datasets transformers

Prepare your configuration:

# For Stage 1 model (7B)
export MODEL_PATH="scrapegoat/ScrapeGoat-Music-Stage1"
export OUTPUT_DIR="./fine_tuned_model_s1"

# For Stage 2 model (1B)
export MODEL_PATH="scrapegoat/ScrapeGoat-Music-Stage2"
export OUTPUT_DIR="./fine_tuned_model_s2"

Start training:

python train.py \
    --model_name_or_path $MODEL_PATH \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-5 \
    --warmup_steps 500 \
    --logging_steps 100 \
    --save_steps 1000 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --gradient_checkpointing true

Training Tips

Stage 1 Model:

Use larger batch sizes (8-16) for better convergence
Enable gradient checkpointing for memory efficiency
Start with a lower learning rate (1e-5)
Train for at least 3 epochs

Stage 2 Model:

Use smaller batch sizes (4-8)
Higher learning rate possible (2e-5)
Shorter training time needed
Focus on audio quality metrics

Monitoring:

Use Weights & Biases for training visualization
Monitor loss curves for convergence
Validate generation quality periodically
Check for overfit on validation set

Performance Optimization:

Enable Flash Attention during training
Use mixed precision training (bf16)
Distribute training across multiple GPUs if available
Implement proper gradient clipping

License

FULL ACCESS, ENJOY