Extract text from images using OCR
Combine text and images to generate responses
Transcribe audio from microphone, file, or YouTube link