Prompt from Image Pipeline?

#48
by HDHunter - opened

Thank you all very much for the very impressive work! I am very new to all of this, so forgive me for the very newb question!

I am having trouble coming up with effective prompts and I was wondering if it would be possible to come up with a pipeline that allowed me to pass an image into the Clip encoder and have the text encoding pipeline generate a prompt for me? I have read that Clip encoders can be used for image captioning and my thinking was that maybe Flux.1 could teach me how to write more effective prompts?

FLUX doesn't do text to text or 'image + text' --> 'text'. The best that I can think of is using a vision model, then tell it what your prompt was and what you would like to change (chatGPT might be able to do this, even for free users). But, I don't think that would help very much (because even with long descriptions, this model can understand them). Even if FLUX could, it's still just 12B parameters and 12B parameter text to text models aren't too bright.

You might want to look into image to image workflows using ComfyUI for tweaking images. I've never touched it though.

If FLUX didn't take up so much VRAM, I'd say that you might want to try running your prompt through an LLM, then have the output of the text model be the input for FLUX.

I have a specific prompt creation function because I'm uncreative and wanted to see some cool generations:

# Python
import random
from typing import List

def generate_prompts(num_prompts: int = 1) -> List[str]:
    themes: List[str] = [
        "a peaceful mountaintop with distant cities glimmering under starlight",
        "a chaotic urban escape scene as a giant monster attacks",
        "a romantic Venetian gondola ride under the moonlight",
        "a heroic battle on an alien planet against invaders",
        "a surreal otherworldly celebration with floating lanterns",
        "the inside of a massive clock tower with intricate gears and mechanisms",
        "a quiet cemetery with ethereal spirits rising at dusk",
        "a grand coronation ceremony in an opulent palace",
        "a devastating volcanic eruption viewed from a safe distance",
        "a bustling medieval market with traders from distant lands",
        "a serene Buddhist temple during a cherry blossom festival",
        "an intense car chase through a rain-soaked futuristic city",
        "a tranquil underwater scene with mermaids and mythical sea creatures",
        "a gripping spacewalk repair mission on a damaged spacecraft",
        "a historic moment of peace treaty signing in a war-torn land",
        "a wild safari adventure with close encounters with majestic animals",
        "a haunted house exploration with paranormal investigators",
        "an epic gladiator fight in an ancient Roman arena",
        "a deep-sea dive into a sunken pirate ship surrounded by sharks",
        "a high-stakes poker game in a smoky, dimly lit room",
        "a majestic view of the Northern Lights from a snowy tundra",
        "an apocalypse survival scenario in a desolate urban wasteland",
        "a fantasy dragon's lair with treasures and sleeping dragon",
        "a cosmic event with a comet passing closely to Earth",
        "an ancient Egyptian ritual ceremony by the pyramids at sunset",
        "a moonlit dance on a secluded beach",
        "a candlelit dinner in a secluded forest clearing",
        "a passionate reunion at a bustling train station",
        "a first kiss under a rain-drenched awning",
        "a proposal at the top of a Ferris wheel"
    ]

    styles: List[str] = [
        "Renaissance-inspired", "Baroque-influenced", "Impressionist", "Modernist", "Abstract",
        "Cubist", "Futuristic", "Deco-inspired", "Gothic", "Surrealist", "Expressionist", "Digital Glitch",
        "Pop Art", "Minimalist", "Post-Impressionist", "Neo-Expressionist", "Photorealistic", "Art Nouveau",
        "Constructivist", "Dada-inspired", "Romantic", "Victorian"
    ]

    events: List[str] = [
        "a serene landscape", "a dynamic action scene", "a tender romantic moment", "an epic heroic showdown",
        "a bizarre and fantastical event", "a historical reenactment", "a science fiction encounter", "a mystical and magical event",
        "a moment of everyday life", "a climactic battle", "a discovery of hidden treasures", "a journey through a dream",
        "a celebration of a cultural festival", "an apocalyptic vision", "a sports event in action", "a reunion of long-lost friends",
        "a quiet reflective moment", "a supernatural occurrence", "a nature documentary scene", "a bustling urban environment",
        "an intimate conversation in a quiet café", "a slow dance in the rain", "a heartwarming scene of homecoming",
        "a surprise romantic getaway", "a playful day at the park"
    ]

    characters: List[str] = [
        "featuring a central figure",
        "with a group of characters",
        "highlighting an iconic figure",
        "without any visible characters" 
    ]

    times_of_day: List[str] = ["at the crack of dawn", "in the blazing afternoon", "at dusk", "in the dead of night"]
    weather_conditions: List[str] = ["during a fierce storm", "on a calm sunny day", "in a foggy setting", "while it snows heavily"]
    emotions: List[str] = ["calm", "tense", "joyful", "terrifying", "mystical"]
    adjectives: List[str] = ["ancient", "modern", "decayed", "flourishing", "mythical", "lonely", "crowded", "dramatic"]

    actions_with_characters: List[str] = ["resting", "celebrating", "escaping", "fighting", "discovering"]
    actions_without_characters: List[str] = ["shrouded in mystery", "illuminated by natural phenomena", "transformed by time", "engulfed in natural beauty", "caught in a temporal standstill"]

    def generate_prompt() -> str:
        theme = random.choice(themes)
        style = random.choice(styles)
        event_type = random.choice(events)
        character_option = random.choice(characters)
        time_of_day = random.choice(times_of_day)
        weather_condition = random.choice(weather_conditions)
        emotion = random.choice(emotions)
        adjective = random.choice(adjectives)
        
        if "without any visible characters" in character_option:
            action: str = random.choice(actions_without_characters)
            prompt: str = f"Create a {style} {event_type} image {character_option}. This {adjective} scene, {action}, occurs in {theme} {time_of_day}, {weather_condition}. The overall mood is {emotion}."
        else:
            action: str = random.choice(actions_with_characters)
            prompt: str = f"Create a {style} {event_type} image {character_option}. This {adjective} scene, where characters are {action}, takes place in {theme} {time_of_day}, {weather_condition}. The overall mood is {emotion}."

        return prompt

    prompts: List[str] = [generate_prompt() for _ in range(num_prompts)]
    return prompts

Generate prompt function sample output:
Prompt 1: Create a Futuristic a moment of everyday life image with a group of characters. This modern scene, where characters are celebrating, takes place in an ancient Egyptian ritual ceremony by the pyramids at sunset at the crack of dawn, on a calm sunny day. The overall mood is calm.
Prompt 2: Create a Baroque-influenced an apocalyptic vision image with a group of characters. This flourishing scene, where characters are resting, takes place in a grand coronation ceremony in an opulent palace at dusk, in a foggy setting. The overall mood is mystical.
Prompt 3: Create a Art Nouveau a heartwarming scene of homecoming image featuring a central figure. This decayed scene, where characters are escaping, takes place in a tranquil underwater scene with mermaids and mythical sea creatures at dusk, while it snows heavily. The overall mood is tense.
Prompt 4: Create a Deco-inspired a serene landscape image without any visible characters. This flourishing scene, caught in a temporal standstill, occurs in a bustling medieval market with traders from distant lands at dusk, while it snows heavily. The overall mood is mystical.
Prompt 5: Create a Photorealistic a heartwarming scene of homecoming image with a group of characters. This flourishing scene, where characters are discovering, takes place in a peaceful mountaintop with distant cities glimmering under starlight in the dead of night, on a calm sunny day. The overall mood is tense.

@DrSpyderNerd Thank you very much for your function! I look forward to trying it out more, the first prompt I ran through Flux.1 produced very impressive and nice results!

InternVL2 appears to be the highest rated VL model at the moment and I ran a few sample images through it with the prompt from the LAION-Pop blog article https://laion.ai/blog/laion-pop/. The Pro (108B parameter) model seems to produce very nice descriptions. Sadly, the official demo is very busy and not dependable at all. I'd need a checkpoint that I could run locally on a free Colab lol! I can't remember now if I tried the 26B or the 8B parameter model but sadly it produced results that were far less effective in Flux.1, I still need to run more samples!

@DrSpyderNerd I was watching some YouTube's yesterday and discovered Florence-2-large. At only 0.77B parameters it is surprising good at image captioning/describing! I do not recommend the DETAILED_CAPTION mode though.

Samples generated by generate_prompt():

Create a Constructivist a reunion of long-lost friends image featuring a central figure. This lonely scene, where characters are celebrating, takes place in a serene Buddhist temple during a cherry blossom festival at the crack of dawn, during a fierce storm. The overall mood is calm.

34_1179187037.webp

Create a Gothic a historical reenactment image with a group of characters. This mythical scene, where characters are fighting, takes place in a surreal otherworldly celebration with floating lanterns at the crack of dawn, on a calm sunny day. The overall mood is tense.

33_1503255876.webp

Out of the box openai/clip-vit-large-patch14 can only do One-Shot Image Classification, so you can pass an image into it and then ask it if it is an image of a cat or an image of a dog. It is rather quick and runs really well on cpu, so if you say wanted to censor the inputs into your image to image workflow it might make a simple safety checker. T5 seems to be superior to GPT2 in most respects so it would probably be possible to develop a transformer that maps the embeddings returned from clip and input them back into the T5 model to create a simple caption. Sort of like they did in ClipCap. Building transformers is well beyond my current ability so I guess I will have to put this on hold.

https://medium.com/@uppalamukesh/clipcap-clip-prefix-for-image-captioning-3970c73573bc
https://arxiv.org/pdf/2201.12723

Thanks again for the great work on flux and to @DrSpyderNerd for sharing his random prompt generator. I was actually quite impressed with the images Flux.1 generated from it!

HDHunter changed discussion status to closed

Sign up or log in to comment