Object Detection with Pascal VOC

By the end of this page, you will understand how an object detection dataset maps to Pixano’s data model and be ready to annotate images with bounding boxes and segmentation masks.

What you'll learn
  • What the Pascal VOC detection challenge is and why it matters
  • How VOC’s annotations map to Pixano’s records, entities, and annotations
  • How to import a VOC dataset using the Pixano Cookbook
  • Key Pixano features for object detection: bbox drawing, interactive segmentation, and annotation provenance

Background: What is Pascal VOC?

The PASCAL Visual Object Classes (VOC) challenge was one of the foundational benchmarks in computer vision, running from 2005 to 2012. It asked a simple but powerful question: given a natural image, can a system identify and localize every object in it?

The VOC 2007 dataset contains images sourced from Flickr, annotated with 20 object categories: person, bird, cat, cow, dog, horse, sheep, aeroplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, and TV/monitor. Each object is annotated with a bounding box (a rectangle around the object) and a category label.

VOC shaped how the computer vision community thinks about detection evaluation. It introduced key concepts still used today:

  • IoU (Intersection over Union) — a metric that measures how well a predicted bounding box overlaps with the ground truth
  • mAP (mean Average Precision) — the standard metric for evaluating detection models, computed across all categories
  • Difficult flag — some objects are marked as “difficult” (heavily occluded, very small, etc.) and excluded from evaluation

The detection problem in Pixano terms

To work with a VOC dataset in Pixano, we need to map its structure to Pixano’s data model:

VOC conceptPixano conceptWhy
An image in the datasetRecordEach image is one sample. The record holds its split (train, val) and workflow status.
The image file itselfView (Image)The media attached to the record. Named "image" in this dataset.
An annotated object (e.g. “car”)EntityEntities represent the real-world objects you want to label. We subclass Entity to add category and is_difficult.
The rectangle around the objectAnnotation (BBox)The geometric label — normalized bounding box coordinates.

This separation is intentional. By keeping records, views, entities, and annotations in separate tables, Pixano makes it easy to attach multiple annotation types to the same entity — for instance a bounding box and a segmentation mask — and to track who created each annotation.

Dataset schema

Here is the DatasetInfo used to import a VOC dataset into Pixano:

from pixano.datasets import DatasetInfo
from pixano.datasets.workspaces import WorkspaceType
from pixano.schemas import BBox, CompressedRLE, Entity, Image, Record
class VOCEntity(Entity):
"""An object detected in a VOC image."""
category: str = ""
is_difficult: bool = False
dataset_info = DatasetInfo(
name="VOC 2007 Sample",
description="Pascal VOC 2007 object detection dataset.",
workspace=WorkspaceType.IMAGE,
record=Record,
entity=VOCEntity,
bbox=BBox,
mask=CompressedRLE,
views={"image": Image},
)

Let’s look at each piece:

  • VOCEntity — A custom subclass of Entity with two domain-specific fields. category stores the object class (e.g. "car", "person"). is_difficult mirrors VOC’s flag for objects that are hard to recognize.

  • bbox=BBox — Creates a bounding box table. This is where the rectangular coordinates for each entity will be stored. VOC annotations use normalized coordinates in [x, y, w, h] format, which Pixano handles natively.

  • mask=CompressedRLE — Creates a segmentation mask table using compressed run-length encoding. The original VOC dataset only has bounding boxes — we add masks here to show that Pixano supports multiple annotation types per entity. This also enables AI-assisted segmentation tools like SAM in the annotation workspace.

  • views={"image": Image} — Each record has one view named "image".

  • workspace=WorkspaceType.IMAGE — Tells the UI to use the single-image viewer with annotation tools.

Import the dataset

The Pixano Cookbook provides a ready-to-run script that downloads Pascal VOC 2007, samples a subset, converts the XML annotations to Pixano’s metadata.jsonl format, and organizes everything into a source folder.

Clone the cookbook and generate the data

Terminal window
git clone https://github.com/pixano/pixano-cookbook.git
cd pixano-cookbook
pip install pillow
python data_importation/voc/generate_sample.py ./voc_sample --num-samples 50

This produces a source folder organized by splits:

voc_sample/
train/
000032.jpg
000045.jpg
...
metadata.jsonl
validation/
000007.jpg
...
metadata.jsonl

What’s inside metadata.jsonl

Each line in metadata.jsonl describes one image and its annotations:

{
"status": "validated",
"views": { "image": "000032.jpg" },
"entities": [
{
"category": "aeroplane",
"is_difficult": false,
"annotations": {
"image": {
"bbox": [0.078, 0.09, 0.756, 0.792]
}
}
}
]
}

Notice the structure:

  • "views" maps view names to filenames — Pixano uses this to find the image
  • "entities" is a list of objects, each with the custom fields from VOCEntity
  • "annotations" is nested per view — the "image" key matches the view name defined in DatasetInfo, and "bbox" matches the annotation slot. Coordinates are normalized [x, y, w, h] values in [0, 1].

Run the import

Terminal window
pixano init ./my_library
pixano data import ./my_library ./voc_sample \
--info data_importation/voc/info.py:dataset_info
Import options

Use --dry-run to validate your metadata without creating the dataset. Use --mode overwrite to re-import over an existing dataset.

Launch and explore

Terminal window
pixano server run ./my_library

Open http://127.0.0.1:7492 in your browser. You will see the Pixano library with a card for the VOC dataset. Click it to open the dataset explorer, where you can browse images, filter by split or category, and click any image to view it with its bounding box annotations.

Key Pixano features for object detection

Bounding box annotation

The core annotation tool for object detection. In the annotation workspace, you draw rectangles over objects in the image. Each bounding box is linked to an entity, which carries the object’s category and any custom fields you defined.

Because annotations and entities are stored separately, you can attach multiple annotations to the same entity. For example, you could have a bounding box and a segmentation mask for the same “car” — each stored in its own table, both linked to the same entity. This is exactly why we included CompressedRLE in the schema even though VOC only provides bounding boxes.

Interactive segmentation with SAM

The Segment Anything Model (SAM) lets you go beyond bounding boxes to pixel-level segmentation. Instead of manually drawing mask boundaries, you provide a few click prompts (foreground/background points or a bounding box) and SAM predicts a precise segmentation mask.

In Pixano, SAM integration works through the inference provider system. When connected to a Pixano Inference server running SAM, the annotation workspace offers a smart segmentation tool: click on an object, and Pixano sends the prompt to SAM, receives the mask, and stores it as a CompressedRLE annotation linked to the entity.

This is particularly powerful for object detection workflows: you can start with bounding boxes for quick labeling, then selectively refine objects with pixel-level masks using SAM — all within the same dataset, on the same entities.

Annotation provenance

Every annotation in Pixano carries built-in provenance fields:

  • source_type — one of model, human, ground_truth, or other
  • source_name — identifies the source (e.g. "yolo11n", "annotator_01")
  • source_metadata — a JSON string for arbitrary metadata (e.g. model version, confidence threshold)

This means you can import ground truth annotations from VOC, add YOLO predictions, and generate SAM masks — all in the same dataset — and always know which annotations came from where. In the UI, you can filter and compare annotations by source, which is essential for evaluating model performance against human labels.

What you’ve learned

Let’s connect the dots:

  • Pascal VOC is an object detection benchmark where each image contains objects localized with bounding boxes and classified into 20 categories.
  • In Pixano, this maps to records (images), entities (objects with category and is_difficult), and annotations (bounding boxes and optionally masks).
  • The DatasetInfo schema defines which tables Pixano creates — adding CompressedRLE alongside BBox enables multi-type annotation and AI segmentation tools.
  • Provenance fields on annotations let you combine human labels, model predictions, and AI-assisted masks in a single dataset.

Next steps

Esc