Skip to content
Why did we open-source our inference engine? Read the post

HTTP API Reference

This reference documents the inference HTTP surface exposed by standalone sie-server and Kubernetes sie-gateway. Component-specific status endpoints are marked below. Cluster-only config and pool endpoints are covered in Config API and Gateway.

EndpointMethodAvailable onPurpose
/v1/encode/:modelPOSTsie-server, sie-gatewayGenerate embeddings
/v1/score/:modelPOSTsie-server, sie-gatewayRerank items
/v1/extract/:modelPOSTsie-server, sie-gatewayExtract entities and structured data
/v1/modelsGETsie-server, sie-gatewayList available models
/v1/models/:modelGETsie-server, sie-gatewayGet model details
/v1/embeddingsPOSTsie-server, sie-gatewayOpenAI-compatible embeddings
/healthzGETAll runtime componentsLiveness probe
/readyzGETAll runtime componentsReadiness probe
/metricsGETAll runtime componentsPrometheus metrics
/ws/statusWebSocketsie-serverReal-time Python sie-server status
/ws/cluster-statusWebSocketsie-gatewayCluster status stream

In Kubernetes, encode, score, extract, and embeddings requests hit the Rust gateway first. The gateway publishes msgpack work items to NATS JetStream, then the SIE server sidecar inside the worker pod pulls, batches, calls the sie-server adapter over IPC, and publishes the result back to the gateway.

SIE defaults to msgpack for efficient binary serialization. This preserves numpy arrays natively and produces ~37% smaller payloads than JSON.

Content negotiation:

  • Content-Type: application/msgpack for requests
  • Accept: application/msgpack for responses (default)
  • Accept: application/json returns JSON

When using JSON, arrays are converted to lists.


Generate embeddings for input items. Supports dense, sparse, and multi-vector outputs.

class EncodeRequest(TypedDict, total=False):
items: list[Item] # Required: items to encode
params: EncodeParams # Optional: encoding parameters
class EncodeParams(TypedDict, total=False):
output_types: list[str] # 'dense', 'sparse', 'multivector'
instruction: str # Task instruction for query encoding
output_dtype: str # 'float32', 'float16', 'int8', 'binary'
options: dict[str, Any] # Profile, LoRA, runtime options
class Item(TypedDict, total=False):
id: str # Client-provided ID (echoed back)
text: str # Text content
images: list[ImageInput] # Image bytes with format hint
class ImageInput(TypedDict, total=False):
data: bytes # Image bytes
format: str # 'jpeg', 'png', 'webp'
class EncodeResponse(TypedDict, total=False):
model: str # Model name used
items: list[EncodeResult] # One result per input item
timing: TimingInfo # Server-side timing breakdown
class EncodeResult(TypedDict, total=False):
id: str # Echoed item ID
dense: DenseVector # Dense embedding
sparse: SparseVector # Sparse embedding
multivector: MultiVector # Per-token embeddings
class DenseVector(TypedDict, total=False):
dims: int # Vector dimensionality
dtype: str # 'float32', 'float16', 'int8', 'binary'
values: list[float] # Vector values
class SparseVector(TypedDict, total=False):
dims: int # Vocabulary size
dtype: str # Data type
indices: list[int] # Non-zero dimension indices
values: list[float] # Values at those indices
class MultiVector(TypedDict, total=False):
token_dims: int # Per-token embedding dimension
num_tokens: int # Number of tokens
dtype: str # Data type
values: list[list[float]] # Token embeddings
ParameterTypeDefaultDescription
itemslist[Item]RequiredItems to encode
params.output_typeslist[str]["dense"]Output types to return
params.instructionstrNoneInstruction prefix for query encoding
params.output_dtypestr"float32"Output precision
params.optionsdictNoneRuntime options (profile, lora, etc.)

Basic encoding:

curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"items": [{"text": "Hello, world!"}]
}'

Multiple output types:

curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"items": [{"text": "Search query"}],
"params": {
"output_types": ["dense", "sparse"],
"instruction": "Represent this query for retrieval:"
}
}'

Response:

{
"model": "BAAI/bge-m3",
"items": [
{
"dense": {
"dims": 1024,
"dtype": "float32",
"values": [0.0234, -0.0891, 0.1234, ...]
},
"sparse": {
"dims": 250002,
"dtype": "float32",
"indices": [101, 2023, 5789, ...],
"values": [0.45, 0.32, 0.28, ...]
}
}
]
}

Rerank items against a query using a cross-encoder model.

class ScoreRequest(TypedDict, total=False):
query: Item # Required: query to score against
items: list[Item] # Required: items to score
instruction: str # Optional instruction
options: dict[str, Any] # Runtime options
class ScoreResponse(TypedDict, total=False):
model: str
query_id: str | None # Echoed query ID
scores: list[ScoreEntry] # Sorted by score descending
class ScoreEntry(TypedDict):
item_id: str | None # Echoed item ID
score: float # Relevance score
rank: int # Position (0 = most relevant)
curl -X POST http://localhost:8080/v1/score/BAAI/bge-reranker-v2-m3 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"query": {"text": "What is machine learning?"},
"items": [
{"id": "doc-1", "text": "ML uses algorithms to learn from data."},
{"id": "doc-2", "text": "The weather is sunny today."}
]
}'

Response:

{
"model": "BAAI/bge-reranker-v2-m3",
"scores": [
{"item_id": "doc-1", "score": 0.891, "rank": 0},
{"item_id": "doc-2", "score": 0.023, "rank": 1}
]
}

Extract structured data from items: entities, relations, classifications, or vision outputs.

class ExtractRequest(TypedDict, total=False):
items: list[Item] # Required: items to extract from
params: ExtractParams # Optional: extraction parameters
class ExtractParams(TypedDict, total=False):
labels: list[str] # Entity types for NER
output_schema: dict # JSON schema for structured extraction
instruction: str # Task instruction
options: dict[str, Any] # Runtime options (see below)

params.options is an adapter-specific dict. Currently supported keys:

KeyTypeDefaultScopeDescription
overflow_policy"default" | "truncate_text" | "error""default"gliclass-* familyControls behavior when text + label_prompt exceeds the model’s context (512 tokens for gliclass-{small,base,large}-v1.0). default passes input through as-is (may surface as INPUT_TOO_LONG on these models). truncate_text truncates the end of text to fit while preserving labels. error always raises INPUT_TOO_LONG on overflow.
class ExtractResponse(TypedDict, total=False):
model: str
items: list[ExtractResult]
class ExtractResult(TypedDict, total=False):
id: str
entities: list[Entity] # NER results
relations: list[Relation] # Relation extraction
classifications: list[Classification]
objects: list[DetectedObject] # Object detection
data: dict[str, Any] # Structured extraction results
class Entity(TypedDict, total=False):
text: str # Extracted span
label: str # Entity type
score: float # Confidence (0-1)
start: int # Start character offset
end: int # End character offset
bbox: list[int] # Bounding box [x, y, w, h] (images)
class Relation(TypedDict):
head: str # Source entity
tail: str # Target entity
relation: str # Relation type
score: float # Confidence
class Classification(TypedDict):
label: str # Class label
score: float # Probability
curl -X POST http://localhost:8080/v1/extract/urchade/gliner_multi-v2.1 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"items": [{"text": "Tim Cook is the CEO of Apple Inc."}],
"params": {
"labels": ["person", "organization", "role"]
}
}'

Response:

{
"model": "urchade/gliner_multi-v2.1",
"items": [
{
"id": "item-0",
"entities": [
{"text": "Tim Cook", "label": "person", "score": 0.93, "start": 0, "end": 8},
{"text": "CEO", "label": "role", "score": 0.88, "start": 16, "end": 19},
{"text": "Apple Inc", "label": "organization", "score": 0.95, "start": 23, "end": 32}
]
}
]
}

Example with overflow_policy on gliclass:

curl -X POST http://localhost:8080/v1/extract/knowledgator/gliclass-small-v1.0 \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"items": [{"text": "<long review text...>"}],
"params": {
"labels": ["positive", "negative", "neutral"],
"options": {"overflow_policy": "truncate_text"}
}
}'

When overflow_policy is "error" (or "default" on gliclass-{small,base,large}-v1.0 past the context cap), the server returns HTTP 400:

{
"detail": {
"code": "INPUT_TOO_LONG",
"message": "Item 0: observed 612 tokens (text=540, label_prompt=72) exceeds context cap 512 for knowledgator/gliclass-small-v1.0",
"model": "knowledgator/gliclass-small-v1.0"
}
}

List all available models with their capabilities.

class ModelsListResponse(BaseModel):
models: list[ModelInfo]
class ModelInfo(BaseModel):
name: str # Model name
inputs: list[str] # Supported inputs: text, image
outputs: list[str] # Supported outputs: dense, sparse, multivector
dims: dict[str, int] # Dimensions per output type
loaded: bool # Whether model is in GPU memory
max_sequence_length: int # Maximum tokens
profiles: dict[str, ProfileInfo] # Available profiles
class ProfileInfo(BaseModel):
is_default: bool # Whether this is the default profile
output_types: list[str] # Output types enabled by this profile
output_similarity: dict[str, str] # Similarity metrics per output type
curl -H "Accept: application/json" http://localhost:8080/v1/models

Response:

{
"models": [
{
"name": "BAAI/bge-m3",
"inputs": ["text"],
"outputs": ["dense", "sparse", "multivector"],
"dims": {"dense": 1024, "sparse": 250002, "multivector": 1024},
"loaded": true,
"max_sequence_length": 8192,
"profiles": {}
},
{
"name": "BAAI/bge-reranker-v2-m3",
"inputs": ["text"],
"outputs": ["score"],
"dims": {},
"loaded": false,
"max_sequence_length": 8192,
"profiles": {}
}
]
}

Drop-in replacement for OpenAI’s embeddings API.

curl -X POST http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d '{
"model": "BAAI/bge-m3",
"input": ["Hello, world!"]
}'

Response:

{
"object": "list",
"model": "BAAI/bge-m3",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0234, -0.0891, ...]
}
],
"usage": {
"prompt_tokens": 3,
"total_tokens": 3
}
}

Works with OpenAI SDK, LangChain’s OpenAIEmbeddings, and other compatible clients.


Liveness probe. Returns 200 if the server process is running.

curl http://localhost:8080/healthz
# "ok"

Readiness probe. On standalone sie-server, returns 200 when the Python process is ready to accept traffic. On the gateway, returns 200 when the process can accept requests; it does not wait for SIE server sidecar health or sie-config.

curl http://localhost:8080/readyz
# "ok"

Prometheus metrics endpoint. Standalone sie-server exposes sie_* adapter metrics. Kubernetes deployments also expose sie_gateway_* on the gateway and sie_worker_* / sie_pull_loop_* from the SIE server sidecar inside each worker pod.

MetricTypeLabelsDescription
sie_requests_totalCountermodel, endpoint, statusTotal request count
sie_request_duration_secondsHistogrammodel, endpoint, phaseLatency by phase
sie_batch_sizeHistogrammodelBatch size distribution
sie_tokens_processed_totalCountermodelTotal tokens processed
sie_queue_depthGaugemodelPending items per model
sie_model_loadedGaugemodel, deviceModel load status (1/0)
sie_model_memory_bytesGaugemodel, deviceGPU memory per model

Real-time Python sie-server status stream. Sends updates every 200ms. In Kubernetes, gateway routing health comes from SIE server sidecar NATS heartbeats; use /ws/cluster-status on the gateway for aggregate cluster status.

{
"timestamp": float, # Unix timestamp
"gpu": str, # GPU type (e.g., "l4", "a100-80gb")
"loaded_models": list[str], # Currently loaded models
"server": {
"version": str,
"uptime_seconds": int,
"user": str,
"working_dir": str,
"pid": int
},
"gpus": [ # Per-GPU metrics
{
"index": int,
"name": str,
"gpu_type": str, # Normalized type (e.g., "l4", "a100-80gb")
"utilization_percent": float,
"memory_used_bytes": int,
"memory_total_bytes": int,
"memory_threshold_pct": float,
"temperature_c": int
}
],
"models": [ # Per-model status
{
"name": str,
"state": str, # "loaded", "loading", "unloading", "available"
"device": str | None,
"memory_bytes": int,
"queue_depth": int,
"queue_pending_items": int,
"config": {...} # Model configuration
}
],
"counters": {...}, # Prometheus counter metrics
"histograms": {...} # Prometheus histogram metrics
}
const ws = new WebSocket("ws://localhost:8080/ws/status");
ws.onmessage = (event) => {
const status = JSON.parse(event.data);
console.log(`GPU utilization: ${status.gpus[0].utilization_percent}%`);
};

All endpoints return consistent error responses:

{
"detail": {
"code": "MODEL_NOT_FOUND",
"message": "Model 'unknown-model' not found"
}
}
CodeHTTP StatusDescription
MODEL_NOT_FOUND404Requested model doesn’t exist
INVALID_INPUT400Invalid request format
INPUT_TOO_LONG400Input exceeds model context (extract endpoint, gliclass family)
MODEL_NOT_LOADED503Model is not loaded or still loading
LORA_LOADING503LoRA adapter is loading (retry with Retry-After header)
QUEUE_FULL503Server overloaded, request queue is full
DEPENDENCY_CONFLICT409Model requires different bundle/dependencies
INFERENCE_ERROR500Error during model inference
INTERNAL_ERROR500Unexpected server error

Timing and tracing information is included in response headers:

HeaderDescription
X-Total-TimeTotal request time (ms)
X-Queue-TimeTime waiting in queue (ms)
X-Tokenization-TimePreprocessing time (ms)
X-Inference-TimeGPU inference time (ms)
X-Postprocessing-TimePostprocessing time (ms), only if > 0
X-Trace-IDOpenTelemetry trace ID for distributed tracing

Contact us

Tell us about your use case and we'll get back to you shortly.