Quickstart: Deploy Built-in Models

This guide walks you through deploying built-in models and running your first inference requests.

Prerequisites

Python 3.10+
A GPU is recommended for production use but not required for testing

Installation

Install Pixano-Inference with the model-specific extras you need:

SAM2 (segmentation)Transformers (detection, VQA)All extras

uv sync --extra sam2

or with pip:

pip install pixano-inference[sam2]

uv sync --extra transformers

or with pip:

pip install pixano-inference[transformers]

uv sync --extra sam2 --extra transformers --extra vllm

or with pip:

pip install pixano-inference[sam2,transformers,vllm]

Write a Python config

Create a file called models.py that declares which models to deploy. Here is a minimal example deploying SAM2 for image segmentation:

from pixano_inference.configs import DeploymentConfig, ModelConfig, Sam2ImageParams


models = [
    ModelConfig(
        name="sam2-image",
        model_class="Sam2ImageModel",
        model_params=Sam2ImageParams(
            path="facebook/sam2-hiera-base-plus",
            torch_dtype="bfloat16",
        ),
        deployment=DeploymentConfig(
            num_gpus=1,
            min_replicas=0,
            max_replicas=2,
            max_batch_size=8,
        ),
    )
]

CPU-only testing

Set num_gpus: 0 and torch_dtype: float32 to run on CPU. This is useful for testing but not recommended for production.

Start the server

From the CLI

pixano-inference --config models.py

Programmatically

from pixano_inference.ray import InferenceServer

server = InferenceServer()
server.register_from_config("models.py")
server.start(blocking=True)

Verify the deployment

Check the health endpoint:

curl http://localhost:7463/health

List deployed models:

curl http://localhost:7463/app/models/

You should see your model in the response:

[
  {
    "name": "sam2-image",
    "capability": "segmentation",
    "model_class": "Sam2ImageModel"
  }
]

Send inference requests

With curl

# Encode an image to base64
IMAGE_B64=$(base64 -i your_image.png)

curl -X POST http://localhost:7463/inference/segmentation/ \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sam2-image",
    "image": "data:image/png;base64,'"$IMAGE_B64"'",
    "points": [[[200, 175]]],
    "labels": [[1]]
  }'

With the Python client

import asyncio
from pixano_inference.client import PixanoInferenceClient
from pixano_inference.schemas import SegmentationRequest


async def main():
    client = PixanoInferenceClient.connect("http://localhost:7463")

    request = SegmentationRequest(
        model="sam2-image",
        image="data:image/png;base64,...",  # Base64-encoded image
        points=[[[200, 175]]],              # Point prompt (x, y)
        labels=[[1]],                       # 1 = foreground
    )
    response = await client.segmentation(request)

    print(f"Status: {response.status}")
    print(f"Processing time: {response.processing_time:.3f}s")
    print(f"Masks: {len(response.data.masks[0])}")
    print(f"Scores: {response.data.scores.to_numpy()}")

    # Decode a mask to a numpy array
    mask = response.data.masks[0][0].to_mask()
    print(f"Mask shape: {mask.shape}")


asyncio.run(main())

Built-in models

The following model classes are available out of the box:

Model class	Capability	Extra required	Example `model_params`
`Sam2ImageModel`	`segmentation`	`sam2`	`path: facebook/sam2-hiera-base-plus`
`Sam2VideoModel`	`tracking`	`sam2`	`path: facebook/sam2-hiera-large`
`GroundingDINOModel`	`detection`	`transformers`	`path: IDEA-Research/grounding-dino-base`
`TransformersVLMModel`	`vlm`	`transformers`	`path: llava-hf/llava-1.5-7b-hf`
`VLLMVLMModel`	`vlm`	`vllm`	`path: Qwen/Qwen2-VL-7B-Instruct`

Multi-model config

You can deploy multiple models in a single config file. Each model gets its own Ray actor with dedicated resources:

from pixano_inference.configs import (
    DeploymentConfig,
    GroundingDINOParams,
    ModelConfig,
    Sam2ImageParams,
    Sam2VideoParams,
)


models = [
    ModelConfig(
        name="sam2-image",
        model_class="Sam2ImageModel",
        model_params=Sam2ImageParams(path="facebook/sam2-hiera-base-plus"),
        deployment=DeploymentConfig(num_gpus=1, max_batch_size=8),
    ),
    ModelConfig(
        name="sam2-video",
        model_class="Sam2VideoModel",
        model_params=Sam2VideoParams(path="facebook/sam2-hiera-large"),
        deployment=DeploymentConfig(num_gpus=1, max_batch_size=1),
    ),
    ModelConfig(
        name="grounding-dino",
        model_class="GroundingDINOModel",
        model_params=GroundingDINOParams(path="IDEA-Research/grounding-dino-base"),
        deployment=DeploymentConfig(num_gpus=1, max_batch_size=8),
    ),
]

Tip

See custom_models.md for external model modules via model_module, and deploy/sam2_example.py for a typed config example.

Deployment configuration reference

Each ModelConfig(...) entry supports the following fields:

Model fields

Field	Type	Default	Description
`name`	`str`	required	Unique model name
`model_class`	`str \\| type`	required	Registered model class name or class object
`model_module`	`str`	`None`	Python module to import before resolving `model_class` (e.g. `my_package.models`). Used for external custom models.
`model_params`	`dict \\| BaseModelParams`	`{}`	Parameters passed to the model (e.g. `path`, `torch_dtype`)

Capability is derived automatically from model_class. For example: SegmentationModel subclasses deploy behind /inference/segmentation/, while DetectionModel subclasses deploy behind /inference/detection/.

Deployment fields (under `deployment:`)

Field	Type	Default	Description
`num_gpus`	`float`	`0.0`	GPUs per replica
`num_cpus`	`float`	`1.0`	CPUs per replica
`memory_mb`	`int`	`None`	Memory limit in MB (`None` = no limit)
`min_replicas`	`int`	`0`	Minimum replicas (0 = scale to zero)
`max_replicas`	`int`	`4`	Maximum replicas
`target_num_ongoing_requests_per_replica`	`int`	`2`	Autoscaling target
`downscale_delay_s`	`float`	`60.0`	Seconds to wait before scaling down
`upscale_delay_s`	`float`	`5.0`	Seconds to wait before scaling up
`max_batch_size`	`int`	`8`	Maximum batch size for inference
`batch_wait_timeout_s`	`float`	`0.1`	Timeout for filling a batch (seconds)

Available capabilities

Capability	Description
`segmentation`	Generate masks from images (SAM2)
`tracking`	Track prompted objects across video frames (SAM2)
`detection`	Detect objects and optionally return masks when the model supports it
`vlm`	Generate text from images and prompts (LLaVA, Qwen2-VL, etc.)

Next steps

To deploy your own custom models (PyTorch, JAX, TensorFlow, or any framework), see the Custom Models Guide.

Quickstart: Deploy Built-in Models

Prerequisites

Installation

Write a Python config

Start the server

From the CLI

Programmatically

Verify the deployment

Send inference requests

With curl

With the Python client

Built-in models

Multi-model config

Deployment configuration reference

Model fields

Deployment fields (under deployment:)

Available capabilities

Next steps

Deployment fields (under `deployment:`)