Visual Question Answering

Instantiate the model

In this example, we will use the LLaVa model with Qwen 2 for its LLM component.

from pixano_inference.providers import VLLMProvider
from pixano_inference.tasks import MultimodalImageNLPTask

vllm_provider = VLLMProvider()
llava_qwen = vllm_provider.load_model(
    name="llava-qwen",
    task=MultimodalImageNLPTask.CONDITIONAL_GENERATION,
    device=torch.device("cuda"),
    path="llava-hf/llava-onevision-qwen2-0.5b-ov-hf",
    config={
        "dtype": "bfloat16",
    }
)

Call the model for inference

For the VQA inference, call the text_image_conditional_generation method.

In this example, we use the picture of a brown bear and ask the model to provide a high level of description of the image.

from pixano_inference.pydantic import TextImageConditionalGenerationOutput


prompt=[
    {
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/9/9e/Ours_brun_parcanimalierpyrenees_1.jpg"
                },
            },
            {"type": "text", "text": "What is displayed in this image ? Answer with high level of description like 1000 words. "},
        ],
        "role": "user",
    }
]

output: TextImageConditionalGenerationOutput = llava_qwen.text_image_conditional_generation(
    prompt=prompt, max_new_tokens=1000
)
output.generated_text