Server Deployment
Pixano-Inference uses Ray Serve to deploy models as GPU-aware actors with built-in autoscaling, request batching, and multi-model serving. It is framework-agnostic: you can deploy PyTorch, JAX, TensorFlow, or any other Python-based model.
Key features
- GPU management -- Per-model CPU/GPU resources in Python config files
- Autoscaling -- Scale-to-zero when idle, scale-up under load, per model
- Request batching -- Configurable batch size and wait timeout per deployment
- Multi-model serving -- Deploy multiple models in a single server, each with dedicated resources
- Custom models -- Bring any model by subclassing a base class
- Python configuration -- Typed
ModelConfigdeclarations in.pyfiles
Architecture
Python config file (`models.py`)
│
▼
InferenceServer
│
├── Ray init (GPU/CPU resources)
│
├── DeploymentManager
│ │
│ ├── ModelActor (sam2-image) ← Ray actor, owns GPU
│ ├── ModelActor (grounding-dino) ← Ray actor, owns GPU
│ └── ModelActor (llava) ← Ray actor, owns GPU
│
└── FastAPI app (uvicorn)
│
├── /health
├── /app/models/
├── /inference/segmentation/
├── /inference/detection/
├── /inference/tracking/
└── /inference/vlm/
The InferenceServer reads a Python config file, initializes Ray, and deploys
each model as a separate Ray actor. The FastAPI application runs in-process via
uvicorn and routes inference requests to the appropriate actor.
Next steps
- Quickstart -- Deploy built-in models in 5 minutes.
- Custom Models Guide -- Write and deploy your own models.