HTTP API Reference
This reference documents the inference HTTP surface exposed by standalone sie-server and Kubernetes sie-gateway. Component-specific status endpoints are marked below. Cluster-only config and pool endpoints are covered in Config API and Gateway.
Endpoint Summary
Section titled “Endpoint Summary”| Endpoint | Method | Available on | Purpose |
|---|---|---|---|
/v1/encode/:model | POST | sie-server, sie-gateway | Generate embeddings |
/v1/score/:model | POST | sie-server, sie-gateway | Rerank items |
/v1/extract/:model | POST | sie-server, sie-gateway | Extract entities and structured data |
/v1/models | GET | sie-server, sie-gateway | List available models |
/v1/models/:model | GET | sie-server, sie-gateway | Get model details |
/v1/embeddings | POST | sie-server, sie-gateway | OpenAI-compatible embeddings |
/healthz | GET | All runtime components | Liveness probe |
/readyz | GET | All runtime components | Readiness probe |
/metrics | GET | All runtime components | Prometheus metrics |
/ws/status | WebSocket | sie-server | Real-time Python sie-server status |
/ws/cluster-status | WebSocket | sie-gateway | Cluster status stream |
In Kubernetes, encode, score, extract, and embeddings requests hit the Rust gateway first. The gateway publishes msgpack work items to NATS JetStream, then the SIE server sidecar inside the worker pod pulls, batches, calls the sie-server adapter over IPC, and publishes the result back to the gateway.
Wire Format
Section titled “Wire Format”SIE defaults to msgpack for efficient binary serialization. This preserves numpy arrays natively and produces ~37% smaller payloads than JSON.
Content negotiation:
Content-Type: application/msgpackfor requestsAccept: application/msgpackfor responses (default)Accept: application/jsonreturns JSON
When using JSON, arrays are converted to lists.
POST /v1/encode/:model
Section titled “POST /v1/encode/:model”Generate embeddings for input items. Supports dense, sparse, and multi-vector outputs.
Request Schema
Section titled “Request Schema”class EncodeRequest(TypedDict, total=False): items: list[Item] # Required: items to encode params: EncodeParams # Optional: encoding parameters
class EncodeParams(TypedDict, total=False): output_types: list[str] # 'dense', 'sparse', 'multivector' instruction: str # Task instruction for query encoding output_dtype: str # 'float32', 'float16', 'int8', 'binary' options: dict[str, Any] # Profile, LoRA, runtime options
class Item(TypedDict, total=False): id: str # Client-provided ID (echoed back) text: str # Text content images: list[ImageInput] # Image bytes with format hint
class ImageInput(TypedDict, total=False): data: bytes # Image bytes format: str # 'jpeg', 'png', 'webp'Response Schema
Section titled “Response Schema”class EncodeResponse(TypedDict, total=False): model: str # Model name used items: list[EncodeResult] # One result per input item timing: TimingInfo # Server-side timing breakdown
class EncodeResult(TypedDict, total=False): id: str # Echoed item ID dense: DenseVector # Dense embedding sparse: SparseVector # Sparse embedding multivector: MultiVector # Per-token embeddings
class DenseVector(TypedDict, total=False): dims: int # Vector dimensionality dtype: str # 'float32', 'float16', 'int8', 'binary' values: list[float] # Vector values
class SparseVector(TypedDict, total=False): dims: int # Vocabulary size dtype: str # Data type indices: list[int] # Non-zero dimension indices values: list[float] # Values at those indices
class MultiVector(TypedDict, total=False): token_dims: int # Per-token embedding dimension num_tokens: int # Number of tokens dtype: str # Data type values: list[list[float]] # Token embeddingsRequest Parameters
Section titled “Request Parameters”| Parameter | Type | Default | Description |
|---|---|---|---|
items | list[Item] | Required | Items to encode |
params.output_types | list[str] | ["dense"] | Output types to return |
params.instruction | str | None | Instruction prefix for query encoding |
params.output_dtype | str | "float32" | Output precision |
params.options | dict | None | Runtime options (profile, lora, etc.) |
Examples
Section titled “Examples”Basic encoding:
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "items": [{"text": "Hello, world!"}] }'Multiple output types:
curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "items": [{"text": "Search query"}], "params": { "output_types": ["dense", "sparse"], "instruction": "Represent this query for retrieval:" } }'Response:
{ "model": "BAAI/bge-m3", "items": [ { "dense": { "dims": 1024, "dtype": "float32", "values": [0.0234, -0.0891, 0.1234, ...] }, "sparse": { "dims": 250002, "dtype": "float32", "indices": [101, 2023, 5789, ...], "values": [0.45, 0.32, 0.28, ...] } } ]}POST /v1/score/:model
Section titled “POST /v1/score/:model”Rerank items against a query using a cross-encoder model.
Request Schema
Section titled “Request Schema”class ScoreRequest(TypedDict, total=False): query: Item # Required: query to score against items: list[Item] # Required: items to score instruction: str # Optional instruction options: dict[str, Any] # Runtime optionsResponse Schema
Section titled “Response Schema”class ScoreResponse(TypedDict, total=False): model: str query_id: str | None # Echoed query ID scores: list[ScoreEntry] # Sorted by score descending
class ScoreEntry(TypedDict): item_id: str | None # Echoed item ID score: float # Relevance score rank: int # Position (0 = most relevant)Example
Section titled “Example”curl -X POST http://localhost:8080/v1/score/BAAI/bge-reranker-v2-m3 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "query": {"text": "What is machine learning?"}, "items": [ {"id": "doc-1", "text": "ML uses algorithms to learn from data."}, {"id": "doc-2", "text": "The weather is sunny today."} ] }'Response:
{ "model": "BAAI/bge-reranker-v2-m3", "scores": [ {"item_id": "doc-1", "score": 0.891, "rank": 0}, {"item_id": "doc-2", "score": 0.023, "rank": 1} ]}POST /v1/extract/:model
Section titled “POST /v1/extract/:model”Extract structured data from items: entities, relations, classifications, or vision outputs.
Request Schema
Section titled “Request Schema”class ExtractRequest(TypedDict, total=False): items: list[Item] # Required: items to extract from params: ExtractParams # Optional: extraction parameters
class ExtractParams(TypedDict, total=False): labels: list[str] # Entity types for NER output_schema: dict # JSON schema for structured extraction instruction: str # Task instruction options: dict[str, Any] # Runtime options (see below)Per-request options
Section titled “Per-request options”params.options is an adapter-specific dict. Currently supported keys:
| Key | Type | Default | Scope | Description |
|---|---|---|---|---|
overflow_policy | "default" | "truncate_text" | "error" | "default" | gliclass-* family | Controls behavior when text + label_prompt exceeds the model’s context (512 tokens for gliclass-{small,base,large}-v1.0). default passes input through as-is (may surface as INPUT_TOO_LONG on these models). truncate_text truncates the end of text to fit while preserving labels. error always raises INPUT_TOO_LONG on overflow. |
Response Schema
Section titled “Response Schema”class ExtractResponse(TypedDict, total=False): model: str items: list[ExtractResult]
class ExtractResult(TypedDict, total=False): id: str entities: list[Entity] # NER results relations: list[Relation] # Relation extraction classifications: list[Classification] objects: list[DetectedObject] # Object detection data: dict[str, Any] # Structured extraction results
class Entity(TypedDict, total=False): text: str # Extracted span label: str # Entity type score: float # Confidence (0-1) start: int # Start character offset end: int # End character offset bbox: list[int] # Bounding box [x, y, w, h] (images)
class Relation(TypedDict): head: str # Source entity tail: str # Target entity relation: str # Relation type score: float # Confidence
class Classification(TypedDict): label: str # Class label score: float # ProbabilityExample
Section titled “Example”curl -X POST http://localhost:8080/v1/extract/urchade/gliner_multi-v2.1 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "items": [{"text": "Tim Cook is the CEO of Apple Inc."}], "params": { "labels": ["person", "organization", "role"] } }'Response:
{ "model": "urchade/gliner_multi-v2.1", "items": [ { "id": "item-0", "entities": [ {"text": "Tim Cook", "label": "person", "score": 0.93, "start": 0, "end": 8}, {"text": "CEO", "label": "role", "score": 0.88, "start": 16, "end": 19}, {"text": "Apple Inc", "label": "organization", "score": 0.95, "start": 23, "end": 32} ] } ]}Example with overflow_policy on gliclass:
curl -X POST http://localhost:8080/v1/extract/knowledgator/gliclass-small-v1.0 \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "items": [{"text": "<long review text...>"}], "params": { "labels": ["positive", "negative", "neutral"], "options": {"overflow_policy": "truncate_text"} } }'When overflow_policy is "error" (or "default" on gliclass-{small,base,large}-v1.0 past the context cap), the server returns HTTP 400:
{ "detail": { "code": "INPUT_TOO_LONG", "message": "Item 0: observed 612 tokens (text=540, label_prompt=72) exceeds context cap 512 for knowledgator/gliclass-small-v1.0", "model": "knowledgator/gliclass-small-v1.0" }}GET /v1/models
Section titled “GET /v1/models”List all available models with their capabilities.
Response Schema
Section titled “Response Schema”class ModelsListResponse(BaseModel): models: list[ModelInfo]
class ModelInfo(BaseModel): name: str # Model name inputs: list[str] # Supported inputs: text, image outputs: list[str] # Supported outputs: dense, sparse, multivector dims: dict[str, int] # Dimensions per output type loaded: bool # Whether model is in GPU memory max_sequence_length: int # Maximum tokens profiles: dict[str, ProfileInfo] # Available profiles
class ProfileInfo(BaseModel): is_default: bool # Whether this is the default profile output_types: list[str] # Output types enabled by this profile output_similarity: dict[str, str] # Similarity metrics per output typeExample
Section titled “Example”curl -H "Accept: application/json" http://localhost:8080/v1/modelsResponse:
{ "models": [ { "name": "BAAI/bge-m3", "inputs": ["text"], "outputs": ["dense", "sparse", "multivector"], "dims": {"dense": 1024, "sparse": 250002, "multivector": 1024}, "loaded": true, "max_sequence_length": 8192, "profiles": {} }, { "name": "BAAI/bge-reranker-v2-m3", "inputs": ["text"], "outputs": ["score"], "dims": {}, "loaded": false, "max_sequence_length": 8192, "profiles": {} } ]}POST /v1/embeddings (OpenAI Compatible)
Section titled “POST /v1/embeddings (OpenAI Compatible)”Drop-in replacement for OpenAI’s embeddings API.
Example
Section titled “Example”curl -X POST http://localhost:8080/v1/embeddings \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "model": "BAAI/bge-m3", "input": ["Hello, world!"] }'Response:
{ "object": "list", "model": "BAAI/bge-m3", "data": [ { "object": "embedding", "index": 0, "embedding": [0.0234, -0.0891, ...] } ], "usage": { "prompt_tokens": 3, "total_tokens": 3 }}Works with OpenAI SDK, LangChain’s OpenAIEmbeddings, and other compatible clients.
Health Endpoints
Section titled “Health Endpoints”GET /healthz
Section titled “GET /healthz”Liveness probe. Returns 200 if the server process is running.
curl http://localhost:8080/healthz# "ok"GET /readyz
Section titled “GET /readyz”Readiness probe. On standalone sie-server, returns 200 when the Python process is ready to accept traffic. On the gateway, returns 200 when the process can accept requests; it does not wait for SIE server sidecar health or sie-config.
curl http://localhost:8080/readyz# "ok"GET /metrics
Section titled “GET /metrics”Prometheus metrics endpoint. Standalone sie-server exposes sie_* adapter metrics. Kubernetes deployments also expose sie_gateway_* on the gateway and sie_worker_* / sie_pull_loop_* from the SIE server sidecar inside each worker pod.
Available Metrics
Section titled “Available Metrics”| Metric | Type | Labels | Description |
|---|---|---|---|
sie_requests_total | Counter | model, endpoint, status | Total request count |
sie_request_duration_seconds | Histogram | model, endpoint, phase | Latency by phase |
sie_batch_size | Histogram | model | Batch size distribution |
sie_tokens_processed_total | Counter | model | Total tokens processed |
sie_queue_depth | Gauge | model | Pending items per model |
sie_model_loaded | Gauge | model, device | Model load status (1/0) |
sie_model_memory_bytes | Gauge | model, device | GPU memory per model |
WebSocket /ws/status
Section titled “WebSocket /ws/status”Real-time Python sie-server status stream. Sends updates every 200ms. In Kubernetes, gateway routing health comes from SIE server sidecar NATS heartbeats; use /ws/cluster-status on the gateway for aggregate cluster status.
Message Schema
Section titled “Message Schema”{ "timestamp": float, # Unix timestamp "gpu": str, # GPU type (e.g., "l4", "a100-80gb") "loaded_models": list[str], # Currently loaded models "server": { "version": str, "uptime_seconds": int, "user": str, "working_dir": str, "pid": int }, "gpus": [ # Per-GPU metrics { "index": int, "name": str, "gpu_type": str, # Normalized type (e.g., "l4", "a100-80gb") "utilization_percent": float, "memory_used_bytes": int, "memory_total_bytes": int, "memory_threshold_pct": float, "temperature_c": int } ], "models": [ # Per-model status { "name": str, "state": str, # "loaded", "loading", "unloading", "available" "device": str | None, "memory_bytes": int, "queue_depth": int, "queue_pending_items": int, "config": {...} # Model configuration } ], "counters": {...}, # Prometheus counter metrics "histograms": {...} # Prometheus histogram metrics}const ws = new WebSocket("ws://localhost:8080/ws/status");ws.onmessage = (event) => { const status = JSON.parse(event.data); console.log(`GPU utilization: ${status.gpus[0].utilization_percent}%`);};Error Responses
Section titled “Error Responses”All endpoints return consistent error responses:
{ "detail": { "code": "MODEL_NOT_FOUND", "message": "Model 'unknown-model' not found" }}Error Codes
Section titled “Error Codes”| Code | HTTP Status | Description |
|---|---|---|
MODEL_NOT_FOUND | 404 | Requested model doesn’t exist |
INVALID_INPUT | 400 | Invalid request format |
INPUT_TOO_LONG | 400 | Input exceeds model context (extract endpoint, gliclass family) |
MODEL_NOT_LOADED | 503 | Model is not loaded or still loading |
LORA_LOADING | 503 | LoRA adapter is loading (retry with Retry-After header) |
QUEUE_FULL | 503 | Server overloaded, request queue is full |
DEPENDENCY_CONFLICT | 409 | Model requires different bundle/dependencies |
INFERENCE_ERROR | 500 | Error during model inference |
INTERNAL_ERROR | 500 | Unexpected server error |
Response Headers
Section titled “Response Headers”Timing and tracing information is included in response headers:
| Header | Description |
|---|---|
X-Total-Time | Total request time (ms) |
X-Queue-Time | Time waiting in queue (ms) |
X-Tokenization-Time | Preprocessing time (ms) |
X-Inference-Time | GPU inference time (ms) |
X-Postprocessing-Time | Postprocessing time (ms), only if > 0 |
X-Trace-ID | OpenTelemetry trace ID for distributed tracing |