--- license: apache-2.0 language: - en tags: - music - text-generation - transformers pipeline_tag: text-generation library_name: transformers --- # Stage 2 Model # ScrapeGoatMusic Generation API A music generation system powered by ScrapeGoatMusic, optimized for NVIDIA H100 GPUs with FastAPI integration. ## System Requirements - NVIDIA H100 GPU - CUDA 12.0 or higher - Python 3.8 - 32GB+ RAM - Ubuntu 22.04 LTS or higher ## Installation 1. Create and activate a conda environment: ```bash conda create -n ScrapeGoatMusic python=3.8 conda activate ScrapeGoatMusic ``` 2. Install PyTorch with CUDA support: ```bash conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia ``` 3. Install dependencies: ```bash pip install descript-audio-codec pip install npy_append_array soundfile pip install fastapi uvicorn python-multipart pip install flash-attn --no-build-isolation ``` 4. Clone and install RepCodec: ```bash cd inference/xcodec_mini_infer git clone https://github.com/mct10/RepCodec.git cd RepCodec pip install . ``` 5. Download required model files: ```bash # Download models from Hugging Face git lfs install cd inference git clone https://huggingface.co/Nathan9/xcodec_mini_infer ``` ## API Setup 1. Create a new file `api.py`: ```python from fastapi import FastAPI, UploadFile, File, Form from fastapi.responses import FileResponse import uvicorn import torch import os import argparse from pathlib import Path import uuid from typing import Optional app = FastAPI(title="ScrapeGoatMusic Generation API") # Initialize models and configurations def init_models(): parser = argparse.ArgumentParser() # Add all your existing arguments here args = parser.parse_args([]) args.stage1_model = "Nathan9/ScrapeGoatMusic-s1-7B-anneal-en-cot" args.stage2_model = "Nathan9/ScrapeGoatMusic-s2-1B-general" args.max_new_tokens = 3000 args.run_n_segments = 2 args.stage2_batch_size = 4 args.output_dir = "./output" args.cuda_idx = 0 # Add other default arguments return args @app.on_event("startup") async def startup_event(): global args args = init_models() os.makedirs(args.output_dir, exist_ok=True) @app.post("/generate") async def generate_music( genre_file: UploadFile = File(...), lyrics_file: UploadFile = File(...), audio_prompt: Optional[UploadFile] = File(None), prompt_start_time: float = Form(0.0), prompt_end_time: float = Form(30.0) ): # Create unique session ID session_id = str(uuid.uuid4()) session_dir = Path(args.output_dir) / session_id os.makedirs(session_dir, exist_ok=True) # Save uploaded files genre_path = session_dir / "genre.txt" lyrics_path = session_dir / "lyrics.txt" with open(genre_path, "wb") as f: f.write(await genre_file.read()) with open(lyrics_path, "wb") as f: f.write(await lyrics_file.read()) # Handle optional audio prompt audio_prompt_path = None if audio_prompt: audio_prompt_path = session_dir / "audio_prompt.wav" with open(audio_prompt_path, "wb") as f: f.write(await audio_prompt.read()) # Run inference try: # Import your inference code here from infer import run_inference output_path = run_inference( args, str(genre_path), str(lyrics_path), str(audio_prompt_path) if audio_prompt_path else None, prompt_start_time, prompt_end_time ) return FileResponse( output_path, media_type="audio/mpeg", filename=f"generated_music_{session_id}.mp3" ) except Exception as e: return {"error": str(e)} if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000) ``` 2. Create a new file `infer.py` with your existing inference code, modified to be imported as a module. ## Running the API 1. Start the API server: ```bash python api.py ``` 2. The API will be available at `http://localhost:8000` ## API Endpoints ### POST /generate Generates music based on provided genre and lyrics. **Parameters:** - `genre_file`: Text file containing genre tags (Required) - `lyrics_file`: Text file containing lyrics (Required) - `audio_prompt`: Audio file for prompt (Optional) - `prompt_start_time`: Start time for audio prompt (Default: 0.0) - `prompt_end_time`: End time for audio prompt (Default: 30.0) **Example using curl:** ```bash curl -X POST "http://localhost:8000/generate" \ -H "accept: application/json" \ -H "Content-Type: multipart/form-data" \ -F "genre_file=@/path/to/genre.txt" \ -F "lyrics_file=@/path/to/lyrics.txt" \ -F "prompt_start_time=0.0" \ -F "prompt_end_time=30.0" ``` **Example genre.txt format:** ``` instrumental pop energetic female vocals ``` **Example lyrics.txt format:** ``` [verse] Your lyrics here [chorus] Your chorus here ``` ## H100 Optimization 1. Enable Flash Attention: ```python model = AutoModelForCausalLM.from_pretrained( stage1_model, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2" ) ``` 2. Optimize memory usage: ```python # Add to your inference configuration torch.cuda.set_device(0) # Use first H100 torch.backends.cudnn.benchmark = True ``` 3. For multi-GPU setup, modify `cuda_idx` in the API configuration. ## Monitoring The API includes Swagger documentation at `http://localhost:8000/docs` for testing and monitoring endpoints. ## Troubleshooting 1. CUDA Out of Memory: - Reduce `stage2_batch_size` - Adjust `max_new_tokens` - Use gradient checkpointing 2. Audio Quality Issues: - Check input audio format (16kHz, mono) - Verify genre tags format - Ensure lyrics follow the correct structure ## Training This model was created through a multi-stage training process optimized for music generation. You can further fine-tune the model on your own data using the following steps: ### Data Preparation 1. Prepare your training data using the provided script: ```bash python prepare_training_data.py ``` The script expects the following directory structure: ``` training_data/ ├── audio_tracks/ # 16kHz mono WAV files ├── lyrics/ # Corresponding lyrics files └── genres/ # Genre tag files ``` ### Training Requirements - NVIDIA H100 GPU (recommended) - 32GB+ GPU memory - Training dataset with: - High-quality audio files (16kHz mono) - Aligned lyrics in structured format - Genre annotations - At least 10,000 samples recommended ### Fine-tuning Steps 1. Install additional training dependencies: ```bash pip install accelerate datasets transformers ``` 2. Prepare your configuration: ```bash # For Stage 1 model (7B) export MODEL_PATH="Nathan9/ScrapeGoatMusic-s1-7B-anneal-en-cot" export OUTPUT_DIR="./fine_tuned_model_s1" # For Stage 2 model (1B) export MODEL_PATH="Nathan9/ScrapeGoatMusic-s2-1B-general" export OUTPUT_DIR="./fine_tuned_model_s2" ``` 3. Start training: ```bash python train.py \ --model_name_or_path $MODEL_PATH \ --output_dir $OUTPUT_DIR \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 4 \ --learning_rate 1e-5 \ --warmup_steps 500 \ --logging_steps 100 \ --save_steps 1000 \ --evaluation_strategy steps \ --load_best_model_at_end \ --gradient_checkpointing true ``` ### Training Tips 1. Stage 1 Model: - Use larger batch sizes (8-16) for better convergence - Enable gradient checkpointing for memory efficiency - Start with a lower learning rate (1e-5) - Train for at least 3 epochs 2. Stage 2 Model: - Use smaller batch sizes (4-8) - Higher learning rate possible (2e-5) - Shorter training time needed - Focus on audio quality metrics 3. Monitoring: - Use Weights & Biases for training visualization - Monitor loss curves for convergence - Validate generation quality periodically - Check for overfit on validation set 4. Performance Optimization: - Enable Flash Attention during training - Use mixed precision training (bf16) - Distribute training across multiple GPUs if available - Implement proper gradient clipping ## License FULL ACCESS, ENJOY