Stage 1 Model

ScrapeGoatMusic Generation API

A music generation system powered by ScrapeGoatMusic, optimized for NVIDIA H100 GPUs with FastAPI integration.

System Requirements

  • NVIDIA H100 GPU
  • CUDA 12.0 or higher
  • Python 3.8
  • 32GB+ RAM
  • Ubuntu 22.04 LTS or higher

Installation

  1. Create and activate a conda environment:
conda create -n ScrapeGoatMusic python=3.8
conda activate ScrapeGoatMusic
  1. Install PyTorch with CUDA support:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
  1. Install dependencies:
pip install descript-audio-codec
pip install npy_append_array soundfile
pip install fastapi uvicorn python-multipart
pip install flash-attn --no-build-isolation
  1. Clone and install RepCodec:
cd inference/xcodec_mini_infer
git clone https://github.com/mct10/RepCodec.git
cd RepCodec
pip install .
  1. Download required model files:
# Download models from Hugging Face
git lfs install
cd inference
git clone scrapegoat/Neural-Audio-Codec

API Setup

  1. Create a new file api.py:
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import FileResponse
import uvicorn
import torch
import os
import argparse
from pathlib import Path
import uuid
from typing import Optional

app = FastAPI(title="ScrapeGoatMusic Generation API")

# Initialize models and configurations
def init_models():
    parser = argparse.ArgumentParser()
    # Add all your existing arguments here
    args = parser.parse_args([])
    args.stage1_model = "scrapegoat/ScrapeGoat-Music-Stage1"
    args.stage2_model = "scrapegoat/ScrapeGoat-Music-Stage1"
    args.max_new_tokens = 3000
    args.run_n_segments = 2
    args.stage2_batch_size = 4
    args.output_dir = "./output"
    args.cuda_idx = 0
    # Add other default arguments
    return args

@app.on_event("startup")
async def startup_event():
    global args
    args = init_models()
    os.makedirs(args.output_dir, exist_ok=True)

@app.post("/generate")
async def generate_music(
    genre_file: UploadFile = File(...),
    lyrics_file: UploadFile = File(...),
    audio_prompt: Optional[UploadFile] = File(None),
    prompt_start_time: float = Form(0.0),
    prompt_end_time: float = Form(30.0)
):
    # Create unique session ID
    session_id = str(uuid.uuid4())
    session_dir = Path(args.output_dir) / session_id
    os.makedirs(session_dir, exist_ok=True)

    # Save uploaded files
    genre_path = session_dir / "genre.txt"
    lyrics_path = session_dir / "lyrics.txt"
    
    with open(genre_path, "wb") as f:
        f.write(await genre_file.read())
    with open(lyrics_path, "wb") as f:
        f.write(await lyrics_file.read())

    # Handle optional audio prompt
    audio_prompt_path = None
    if audio_prompt:
        audio_prompt_path = session_dir / "audio_prompt.wav"
        with open(audio_prompt_path, "wb") as f:
            f.write(await audio_prompt.read())

    # Run inference
    try:
        # Import your inference code here
        from infer import run_inference
        output_path = run_inference(
            args,
            str(genre_path),
            str(lyrics_path),
            str(audio_prompt_path) if audio_prompt_path else None,
            prompt_start_time,
            prompt_end_time
        )
        
        return FileResponse(
            output_path,
            media_type="audio/mpeg",
            filename=f"generated_music_{session_id}.mp3"
        )
    except Exception as e:
        return {"error": str(e)}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
  1. Create a new file infer.py with your existing inference code, modified to be imported as a module.

Running the API

  1. Start the API server:
python api.py
  1. The API will be available at http://localhost:8000

API Endpoints

POST /generate

Generates music based on provided genre and lyrics.

Parameters:

  • genre_file: Text file containing genre tags (Required)
  • lyrics_file: Text file containing lyrics (Required)
  • audio_prompt: Audio file for prompt (Optional)
  • prompt_start_time: Start time for audio prompt (Default: 0.0)
  • prompt_end_time: End time for audio prompt (Default: 30.0)

Example using curl:

curl -X POST "http://localhost:8000/generate" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "genre_file=@/path/to/genre.txt" \
  -F "lyrics_file=@/path/to/lyrics.txt" \
  -F "prompt_start_time=0.0" \
  -F "prompt_end_time=30.0"

Example genre.txt format:

instrumental pop energetic female vocals

Example lyrics.txt format:

[verse]
Your lyrics here
[chorus]
Your chorus here

H100 Optimization

  1. Enable Flash Attention:
model = AutoModelForCausalLM.from_pretrained(
    stage1_model,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)
  1. Optimize memory usage:
# Add to your inference configuration
torch.cuda.set_device(0)  # Use first H100
torch.backends.cudnn.benchmark = True
  1. For multi-GPU setup, modify cuda_idx in the API configuration.

Monitoring

The API includes Swagger documentation at http://localhost:8000/docs for testing and monitoring endpoints.

Troubleshooting

  1. CUDA Out of Memory:
  • Reduce stage2_batch_size
  • Adjust max_new_tokens
  • Use gradient checkpointing
  1. Audio Quality Issues:
  • Check input audio format (16kHz, mono)
  • Verify genre tags format
  • Ensure lyrics follow the correct structure

Training

This model was created through a multi-stage training process optimized for music generation. You can further fine-tune the model on your own data using the following steps:

Data Preparation

  1. Prepare your training data using the provided script:
python prepare_training_data.py

The script expects the following directory structure:

training_data/
β”œβ”€β”€ audio_tracks/      # 16kHz mono WAV files
β”œβ”€β”€ lyrics/           # Corresponding lyrics files
└── genres/          # Genre tag files

Training Requirements

  • NVIDIA H100 GPU (recommended)
  • 32GB+ GPU memory
  • Training dataset with:
    • High-quality audio files (16kHz mono)
    • Aligned lyrics in structured format
    • Genre annotations
    • At least 10,000 samples recommended

Fine-tuning Steps

  1. Install additional training dependencies:
pip install accelerate datasets transformers
  1. Prepare your configuration:
# For Stage 1 model (7B)
export MODEL_PATH="scrapegoat/ScrapeGoat-Music-Stage1"
export OUTPUT_DIR="./fine_tuned_model_s1"

# For Stage 2 model (1B)
export MODEL_PATH="scrapegoat/ScrapeGoat-Music-Stage2"
export OUTPUT_DIR="./fine_tuned_model_s2"
  1. Start training:
python train.py \
    --model_name_or_path $MODEL_PATH \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-5 \
    --warmup_steps 500 \
    --logging_steps 100 \
    --save_steps 1000 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --gradient_checkpointing true

Training Tips

  1. Stage 1 Model:
  • Use larger batch sizes (8-16) for better convergence
  • Enable gradient checkpointing for memory efficiency
  • Start with a lower learning rate (1e-5)
  • Train for at least 3 epochs
  1. Stage 2 Model:
  • Use smaller batch sizes (4-8)
  • Higher learning rate possible (2e-5)
  • Shorter training time needed
  • Focus on audio quality metrics
  1. Monitoring:
  • Use Weights & Biases for training visualization
  • Monitor loss curves for convergence
  • Validate generation quality periodically
  • Check for overfit on validation set
  1. Performance Optimization:
  • Enable Flash Attention during training
  • Use mixed precision training (bf16)
  • Distribute training across multiple GPUs if available
  • Implement proper gradient clipping

License

FULL ACCESS, ENJOY

Downloads last month
1
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.