Quickstart: Deploy Built-in Models
This guide walks you through deploying built-in models and running your first inference requests.
Prerequisites
- Python 3.10+
- A GPU is recommended for production use but not required for testing
Installation
Install Pixano-Inference with the model-specific extras you need:
Write a Python config
Create a file called models.py that declares which models to deploy.
Here is a minimal example deploying SAM2 for image segmentation:
from pixano_inference.configs import DeploymentConfig, ModelConfig, Sam2ImageParams
models = [
ModelConfig(
name="sam2-image",
model_class="Sam2ImageModel",
model_params=Sam2ImageParams(
path="facebook/sam2-hiera-base-plus",
torch_dtype="bfloat16",
),
deployment=DeploymentConfig(
num_gpus=1,
min_replicas=0,
max_replicas=2,
max_batch_size=8,
),
)
]
CPU-only testing
Set num_gpus: 0 and torch_dtype: float32 to run on CPU.
This is useful for testing but not recommended for production.
Start the server
From the CLI
Programmatically
from pixano_inference.ray import InferenceServer
server = InferenceServer()
server.register_from_config("models.py")
server.start(blocking=True)
Verify the deployment
Check the health endpoint:
List deployed models:
You should see your model in the response:
Send inference requests
With curl
# Encode an image to base64
IMAGE_B64=$(base64 -i your_image.png)
curl -X POST http://localhost:7463/inference/segmentation/ \
-H "Content-Type: application/json" \
-d '{
"model": "sam2-image",
"image": "data:image/png;base64,'"$IMAGE_B64"'",
"points": [[[200, 175]]],
"labels": [[1]]
}'
With the Python client
import asyncio
from pixano_inference.client import PixanoInferenceClient
from pixano_inference.schemas import SegmentationRequest
async def main():
client = PixanoInferenceClient.connect("http://localhost:7463")
request = SegmentationRequest(
model="sam2-image",
image="data:image/png;base64,...", # Base64-encoded image
points=[[[200, 175]]], # Point prompt (x, y)
labels=[[1]], # 1 = foreground
)
response = await client.segmentation(request)
print(f"Status: {response.status}")
print(f"Processing time: {response.processing_time:.3f}s")
print(f"Masks: {len(response.data.masks[0])}")
print(f"Scores: {response.data.scores.to_numpy()}")
# Decode a mask to a numpy array
mask = response.data.masks[0][0].to_mask()
print(f"Mask shape: {mask.shape}")
asyncio.run(main())
Built-in models
The following model classes are available out of the box:
| Model class | Capability | Extra required | Example model_params |
|---|---|---|---|
Sam2ImageModel |
segmentation |
sam2 |
path: facebook/sam2-hiera-base-plus |
Sam2VideoModel |
tracking |
sam2 |
path: facebook/sam2-hiera-large |
GroundingDINOModel |
detection |
transformers |
path: IDEA-Research/grounding-dino-base |
TransformersVLMModel |
vlm |
transformers |
path: llava-hf/llava-1.5-7b-hf |
VLLMVLMModel |
vlm |
vllm |
path: Qwen/Qwen2-VL-7B-Instruct |
Multi-model config
You can deploy multiple models in a single config file. Each model gets its own Ray actor with dedicated resources:
from pixano_inference.configs import (
DeploymentConfig,
GroundingDINOParams,
ModelConfig,
Sam2ImageParams,
Sam2VideoParams,
)
models = [
ModelConfig(
name="sam2-image",
model_class="Sam2ImageModel",
model_params=Sam2ImageParams(path="facebook/sam2-hiera-base-plus"),
deployment=DeploymentConfig(num_gpus=1, max_batch_size=8),
),
ModelConfig(
name="sam2-video",
model_class="Sam2VideoModel",
model_params=Sam2VideoParams(path="facebook/sam2-hiera-large"),
deployment=DeploymentConfig(num_gpus=1, max_batch_size=1),
),
ModelConfig(
name="grounding-dino",
model_class="GroundingDINOModel",
model_params=GroundingDINOParams(path="IDEA-Research/grounding-dino-base"),
deployment=DeploymentConfig(num_gpus=1, max_batch_size=8),
),
]
Tip
See custom_models.md for external model modules via
model_module, and deploy/sam2_example.py
for a typed config example.
Deployment configuration reference
Each ModelConfig(...) entry supports the following fields:
Model fields
| Field | Type | Default | Description |
|---|---|---|---|
name |
str |
required | Unique model name |
model_class |
str \| type |
required | Registered model class name or class object |
model_module |
str |
None |
Python module to import before resolving model_class (e.g. my_package.models). Used for external custom models. |
model_params |
dict \| BaseModelParams |
{} |
Parameters passed to the model (e.g. path, torch_dtype) |
Capability is derived automatically from model_class. For example:
SegmentationModel subclasses deploy behind /inference/segmentation/, while
DetectionModel subclasses deploy behind /inference/detection/.
Deployment fields (under deployment:)
| Field | Type | Default | Description |
|---|---|---|---|
num_gpus |
float |
0.0 |
GPUs per replica |
num_cpus |
float |
1.0 |
CPUs per replica |
memory_mb |
int |
None |
Memory limit in MB (None = no limit) |
min_replicas |
int |
0 |
Minimum replicas (0 = scale to zero) |
max_replicas |
int |
4 |
Maximum replicas |
target_num_ongoing_requests_per_replica |
int |
2 |
Autoscaling target |
downscale_delay_s |
float |
60.0 |
Seconds to wait before scaling down |
upscale_delay_s |
float |
5.0 |
Seconds to wait before scaling up |
max_batch_size |
int |
8 |
Maximum batch size for inference |
batch_wait_timeout_s |
float |
0.1 |
Timeout for filling a batch (seconds) |
Available capabilities
| Capability | Description |
|---|---|
segmentation |
Generate masks from images (SAM2) |
tracking |
Track prompted objects across video frames (SAM2) |
detection |
Detect objects and optionally return masks when the model supports it |
vlm |
Generate text from images and prompts (LLaVA, Qwen2-VL, etc.) |
Next steps
To deploy your own custom models (PyTorch, JAX, TensorFlow, or any framework), see the Custom Models Guide.