Skip to content

Creating a dataset

Defining a dataset

A dataset is a collection of tables from different SchemaGroups. To generate this collection, you provide a DatasetItem, which is essentially a BaseModel. The attributes of this DatasetItem must be features (see our key concepts page) and are organized as follows:

  • Schemas are stored in their respective tables.
  • Types are stored in the 'item' table.

Features can be a single element or a list.

Example:

from pixano.features import BBox, Classification, Entity, Image, Text
from pixano.datasets import DatasetItem


class MyEntity(Entity):
    category: str
    metadata_int: int


class MyDatasetItem(DatasetItem):
    item_metadata: str # will be stored in table 'item'
    image: Image # will be stored in table 'image'
    texts: list[Text] # will be stored in table 'texts'
    image_entities: list[MyEntity] # will be stored in table 'image_entities'
    bboxes: list[BBox] # will be stored in table 'bboxes'
    text_entities: list[Entity] # will be stored in table 'text_entities'
    text_classifs: list[Classification] # will be stored in table 'text_classifs'

Building a dataset

To build the dataset, Pixano provides a generic class DatasetBuilder and specific classes detailed in the api reference.

To construct a dataset builder, derive from DatasetBuilder to properly handle your data and implement the generate_data method. This method returns an iterator of dictionnary whose keys are table_names and values one BaseSchema or a list of BaseSchema to insert into that table.

from pixano.datasets.builders import DatasetBuilder


info = DatasetInfo(
    name="My super dataset",
    description="This dataset tracks reference to super stuff."
)

class MyDatasetBuilder(DatasetBuilder):
    def __init__(
        target_dir: Path | str,
    ):
        super().__init__(
            target_dir=target_dir,
            schemas=MyDatasetItem,
            info=info
        )

    def generate_data() -> Iterator[dict[str, BaseSchema | list[BaseSchema]]]:
        ...

        return {
            "item": ...,
            "bboxes": ...,
            ...
        }

For more detailed information, check our tutorial on how to build and query a dataset.