Stage 1 Model
ScrapeGoatMusic Generation API
A music generation system powered by ScrapeGoatMusic, optimized for NVIDIA H100 GPUs with FastAPI integration.
System Requirements
- NVIDIA H100 GPU
- CUDA 12.0 or higher
- Python 3.8
- 32GB+ RAM
- Ubuntu 22.04 LTS or higher
Installation
- Create and activate a conda environment:
conda create -n ScrapeGoatMusic python=3.8
conda activate ScrapeGoatMusic
- Install PyTorch with CUDA support:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
- Install dependencies:
pip install descript-audio-codec
pip install npy_append_array soundfile
pip install fastapi uvicorn python-multipart
pip install flash-attn --no-build-isolation
- Clone and install RepCodec:
cd inference/xcodec_mini_infer
git clone https://github.com/mct10/RepCodec.git
cd RepCodec
pip install .
- Download required model files:
# Download models from Hugging Face
git lfs install
cd inference
git clone scrapegoat/Neural-Audio-Codec
API Setup
- Create a new file
api.py
:
from fastapi import FastAPI, UploadFile, File, Form
from fastapi.responses import FileResponse
import uvicorn
import torch
import os
import argparse
from pathlib import Path
import uuid
from typing import Optional
app = FastAPI(title="ScrapeGoatMusic Generation API")
# Initialize models and configurations
def init_models():
parser = argparse.ArgumentParser()
# Add all your existing arguments here
args = parser.parse_args([])
args.stage1_model = "scrapegoat/ScrapeGoat-Music-Stage1"
args.stage2_model = "scrapegoat/ScrapeGoat-Music-Stage1"
args.max_new_tokens = 3000
args.run_n_segments = 2
args.stage2_batch_size = 4
args.output_dir = "./output"
args.cuda_idx = 0
# Add other default arguments
return args
@app.on_event("startup")
async def startup_event():
global args
args = init_models()
os.makedirs(args.output_dir, exist_ok=True)
@app.post("/generate")
async def generate_music(
genre_file: UploadFile = File(...),
lyrics_file: UploadFile = File(...),
audio_prompt: Optional[UploadFile] = File(None),
prompt_start_time: float = Form(0.0),
prompt_end_time: float = Form(30.0)
):
# Create unique session ID
session_id = str(uuid.uuid4())
session_dir = Path(args.output_dir) / session_id
os.makedirs(session_dir, exist_ok=True)
# Save uploaded files
genre_path = session_dir / "genre.txt"
lyrics_path = session_dir / "lyrics.txt"
with open(genre_path, "wb") as f:
f.write(await genre_file.read())
with open(lyrics_path, "wb") as f:
f.write(await lyrics_file.read())
# Handle optional audio prompt
audio_prompt_path = None
if audio_prompt:
audio_prompt_path = session_dir / "audio_prompt.wav"
with open(audio_prompt_path, "wb") as f:
f.write(await audio_prompt.read())
# Run inference
try:
# Import your inference code here
from infer import run_inference
output_path = run_inference(
args,
str(genre_path),
str(lyrics_path),
str(audio_prompt_path) if audio_prompt_path else None,
prompt_start_time,
prompt_end_time
)
return FileResponse(
output_path,
media_type="audio/mpeg",
filename=f"generated_music_{session_id}.mp3"
)
except Exception as e:
return {"error": str(e)}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
- Create a new file
infer.py
with your existing inference code, modified to be imported as a module.
Running the API
- Start the API server:
python api.py
- The API will be available at
http://localhost:8000
API Endpoints
POST /generate
Generates music based on provided genre and lyrics.
Parameters:
genre_file
: Text file containing genre tags (Required)lyrics_file
: Text file containing lyrics (Required)audio_prompt
: Audio file for prompt (Optional)prompt_start_time
: Start time for audio prompt (Default: 0.0)prompt_end_time
: End time for audio prompt (Default: 30.0)
Example using curl:
curl -X POST "http://localhost:8000/generate" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "genre_file=@/path/to/genre.txt" \
-F "lyrics_file=@/path/to/lyrics.txt" \
-F "prompt_start_time=0.0" \
-F "prompt_end_time=30.0"
Example genre.txt format:
instrumental pop energetic female vocals
Example lyrics.txt format:
[verse]
Your lyrics here
[chorus]
Your chorus here
H100 Optimization
- Enable Flash Attention:
model = AutoModelForCausalLM.from_pretrained(
stage1_model,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
)
- Optimize memory usage:
# Add to your inference configuration
torch.cuda.set_device(0) # Use first H100
torch.backends.cudnn.benchmark = True
- For multi-GPU setup, modify
cuda_idx
in the API configuration.
Monitoring
The API includes Swagger documentation at http://localhost:8000/docs
for testing and monitoring endpoints.
Troubleshooting
- CUDA Out of Memory:
- Reduce
stage2_batch_size
- Adjust
max_new_tokens
- Use gradient checkpointing
- Audio Quality Issues:
- Check input audio format (16kHz, mono)
- Verify genre tags format
- Ensure lyrics follow the correct structure
Training
This model was created through a multi-stage training process optimized for music generation. You can further fine-tune the model on your own data using the following steps:
Data Preparation
- Prepare your training data using the provided script:
python prepare_training_data.py
The script expects the following directory structure:
training_data/
βββ audio_tracks/ # 16kHz mono WAV files
βββ lyrics/ # Corresponding lyrics files
βββ genres/ # Genre tag files
Training Requirements
- NVIDIA H100 GPU (recommended)
- 32GB+ GPU memory
- Training dataset with:
- High-quality audio files (16kHz mono)
- Aligned lyrics in structured format
- Genre annotations
- At least 10,000 samples recommended
Fine-tuning Steps
- Install additional training dependencies:
pip install accelerate datasets transformers
- Prepare your configuration:
# For Stage 1 model (7B)
export MODEL_PATH="scrapegoat/ScrapeGoat-Music-Stage1"
export OUTPUT_DIR="./fine_tuned_model_s1"
# For Stage 2 model (1B)
export MODEL_PATH="scrapegoat/ScrapeGoat-Music-Stage2"
export OUTPUT_DIR="./fine_tuned_model_s2"
- Start training:
python train.py \
--model_name_or_path $MODEL_PATH \
--output_dir $OUTPUT_DIR \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--warmup_steps 500 \
--logging_steps 100 \
--save_steps 1000 \
--evaluation_strategy steps \
--load_best_model_at_end \
--gradient_checkpointing true
Training Tips
- Stage 1 Model:
- Use larger batch sizes (8-16) for better convergence
- Enable gradient checkpointing for memory efficiency
- Start with a lower learning rate (1e-5)
- Train for at least 3 epochs
- Stage 2 Model:
- Use smaller batch sizes (4-8)
- Higher learning rate possible (2e-5)
- Shorter training time needed
- Focus on audio quality metrics
- Monitoring:
- Use Weights & Biases for training visualization
- Monitor loss curves for convergence
- Validate generation quality periodically
- Check for overfit on validation set
- Performance Optimization:
- Enable Flash Attention during training
- Use mixed precision training (bf16)
- Distribute training across multiple GPUs if available
- Implement proper gradient clipping
License
FULL ACCESS, ENJOY
- Downloads last month
- 1
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.