Visual Question Answering with VQAv2
By the end of this page, you will understand how visual question answering datasets map to Pixano’s data model, and discover how Pixano uniquely combines Q&A annotation with spatial entity annotation in the same workspace.
- What the VQAv2 challenge is and why it matters for multimodal AI
- How VQAv2’s questions, answers, and images map to Pixano’s records, messages, and views
- How Pixano goes beyond pure Q&A by letting you annotate entities with bounding boxes and masks alongside questions
- Key Pixano features: message annotation, VLM integration, and the combination of language and spatial annotation
Background: What is VQAv2?
Visual Question Answering (VQA) is the task of answering natural language questions about an image. VQAv2 (2017) is a large-scale benchmark for this task, containing over 1 million questions on 200,000+ images from MS COCO. It was designed to reduce language bias — each question is paired with two similar images that have different answers, forcing models to ground their reasoning in visual content rather than guessing from the question alone.
- VQAv2 website
- Making the V in VQA Matter (Goyal et al., 2017) — the VQAv2 paper
- VQA: Visual Question Answering (Antol et al., 2015) — the original VQA paper
The VQA problem in Pixano terms
Here is how VQAv2’s structure maps to Pixano:
| VQAv2 concept | Pixano concept | Why |
|---|---|---|
| An image in the dataset | Record | Each image is one sample, with a split and workflow status. |
| The image file | View (Image) | The visual content attached to the record. |
| A question about the image | Message (type QUESTION) | A structured annotation with the question text, type (open/multiple-choice), and optional choices. |
| An answer to the question | Message (type ANSWER) | A separate message linked to the same conversation, containing the answer text. |
| A set of Q&A pairs for one image | Conversation (via conversation_id) | Messages are grouped into conversations by a shared conversation_id. |
But here is where Pixano goes further than a typical VQA tool. A question like “What animal is lying on the couch?” implicitly refers to a specific object in the image. In most VQA datasets, this spatial reference is lost — the answer is just text. Pixano lets you explicitly annotate the objects that questions refer to, using the same entity and annotation model from object detection:
| Additional concept | Pixano concept | Why |
|---|---|---|
| The object a question refers to | Entity | The real-world object (e.g. “the cat on the couch”). We subclass Entity to add category and subcategory. |
| A bounding box around that object | Annotation (BBox) | Spatially locates the entity in the image. |
| A segmentation mask for that object | Annotation (CompressedRLE) | Pixel-level delineation of the entity. |
This combination of language annotation (Q&A messages) and spatial annotation (bounding boxes, masks) in the same dataset is a distinctive feature of Pixano. It lets you build richer datasets where questions are grounded in specific image regions.
Dataset schema
Here is the DatasetInfo used to import a VQAv2 dataset:
from pixano.datasets import DatasetInfofrom pixano.datasets.workspaces import WorkspaceTypefrom pixano.schemas import BBox, CompressedRLE, Entity, Image, Message, Record
class VQAv2Entity(Entity): """An entity referenced by VQA questions.""" category: str = "" subcategory: str = ""
dataset_info = DatasetInfo( name="VQAv2 Sample", description="VQAv2 visual question answering dataset.", workspace=WorkspaceType.IMAGE_VQA, record=Record, entity=VQAv2Entity, message=Message, bbox=BBox, mask=CompressedRLE, views={"image": Image},)Let’s break this down:
-
VQAv2Entity— A custom entity withcategory(e.g."animal","furniture") andsubcategory(e.g."cat","couch"). These entities represent the objects that questions refer to. -
message=Message— Creates a message table for Q&A pairs. EachMessagehas:type—"QUESTION"or"ANSWER"(or"SYSTEM"for instructions)content— the text of the question or answerquestion_type—"OPEN"for free-form,"SINGLE_CHOICE"or"MULTI_CHOICE"for multiple-choicechoices— a list of options (only for multiple-choice questions)conversation_id— groups related messages togethernumber— the message’s position within the conversationentity_ids— links the message to specific entities in the image
-
bbox=BBoxandmask=CompressedRLE— The original VQAv2 dataset contains only questions and answers. We include bounding boxes and masks because Pixano lets you enrich a VQA dataset with entity annotations. You can draw boxes and masks on the objects that questions refer to, creating a bridge between language understanding and spatial grounding. -
workspace=WorkspaceType.IMAGE_VQA— Switches the UI to VQA mode: the image is displayed alongside a conversation panel for managing questions and answers.
Import the dataset
The Pixano Cookbook provides a script that downloads a VQAv2 subset from HuggingFace and generates a Pixano-compatible source folder.
Clone the cookbook and generate the data
git clone https://github.com/pixano/pixano-cookbook.gitcd pixano-cookbook
pip install datasets pillowpython data_importation/vqav2/generate_sample.py ./vqav2_sample --num-samples 50This produces:
vqav2_sample/ validation/ 000000.jpg 000001.jpg ... metadata.jsonlWhat’s inside metadata.jsonl
Each line describes one image with its questions and answers:
{ "status": "validated", "views": { "image": "000000.jpg" }, "messages": [ { "question": { "content": "Where are the kids riding?", "question_type": "OPEN" }, "responses": [ { "content": "carnival ride" } ] } ]}Notice the structure:
"messages"is a list of Q&A exchanges for this image- Each exchange has a
"question"object with"content"and"question_type" "responses"is a list of answers — VQAv2 supports multiple answers per question from different annotators
Multiple-choice questions
For multiple-choice questions, set question_type to "SINGLE_CHOICE" or "MULTI_CHOICE" and add a choices list:
{ "question": { "content": "What is in the image?", "question_type": "SINGLE_CHOICE", "choices": ["a cat", "a dog", "a bird", "a fish"] }, "responses": [{ "content": "a cat" }]}Run the import
pixano init ./my_librarypixano data import ./my_library ./vqav2_sample \ --info data_importation/vqav2/info.py:dataset_infoLaunch and explore
pixano server run ./my_libraryOpen http://127.0.0.1:7492 and click the VQAv2 dataset card. When you open an image, the UI displays the image in the center with a VQA conversation panel on the side. The panel lists all questions for the current image, their answers, and controls for adding new Q&A pairs.
Key Pixano features for VQA
Message annotation
Messages are Pixano’s annotation type for conversational data. Unlike bounding boxes or masks which describe where something is, messages describe what is being asked and answered about the image.
Each message has a type (QUESTION or ANSWER) and belongs to a conversation (grouped by conversation_id). Questions can be open-ended (free-form text) or multiple-choice (with a list of options). Multiple answers per question are supported — this is important for VQA evaluation, where the ground truth is typically the consensus across multiple annotators.
The entity_ids field on messages lets you link a question to specific entities in the image, connecting the language annotation to spatial annotations.
Combining Q&A with entity annotation
This is where Pixano differs from typical VQA annotation tools. Most VQA platforms only handle questions and answers — the visual content is view-only. Pixano treats VQA images as full annotation workspaces.
Because the schema includes BBox and CompressedRLE alongside Message, you can:
- Draw bounding boxes around the objects that questions refer to
- Create segmentation masks for precise spatial grounding
- Link entities to messages via
entity_ids, explicitly connecting “What animal is on the couch?” to the cat entity with its bounding box
This creates datasets that are richer than pure VQA: each question is not just a text pair but a grounded reference to a specific region of the image. This kind of grounded VQA data is increasingly valuable for training and evaluating multimodal models that need to localize their answers.
VLM integration
Pixano can connect to a Pixano Inference server running vision-language models (VLMs) to assist with annotation. When a compatible model is connected:
- You can generate answers to existing questions — click the star icon next to a question, and the VLM produces an answer based on the image and question text
- You can generate questions — the model proposes questions about the image content
- Annotators can then review and correct the AI-generated outputs
This human-in-the-loop workflow speeds up annotation significantly: instead of writing every question and answer from scratch, annotators review and refine model outputs.
Annotation provenance
As with all annotation types in Pixano, messages carry provenance fields (source_type, source_name, source_metadata). This lets you distinguish:
- Human-written questions and answers from annotators
- Model-generated answers from a connected VLM
- Ground truth answers imported from the original VQAv2 dataset
When combining human and AI annotations in the same dataset, provenance is essential for evaluation and quality control.
What you’ve learned
Let’s connect the dots:
- VQAv2 is a visual question answering benchmark where each image is paired with natural language questions and multiple human answers, designed to test whether models truly ground their reasoning in visual content.
- In Pixano, questions and answers are stored as Message annotations with types (
QUESTION,ANSWER), grouped byconversation_id. TheIMAGE_VQAworkspace provides a dedicated conversation panel alongside the image. - Pixano’s unique strength is combining Q&A with spatial annotation: by including
BBoxandCompressedRLEin the schema, you can annotate the entities that questions refer to, creating grounded VQA datasets. - VLM integration enables a human-in-the-loop workflow where models generate draft questions and answers that annotators review and correct.
Next steps
- Object Detection with Pascal VOC — learn how entity and bbox annotation works in depth
- Video Object Tracking with DAVIS — extend to video with tracklets and temporal masks
- Key Concepts — deeper dive into the data model
- API Reference — full Python API documentation