HTTP API Reference

This reference documents the inference HTTP surface exposed by standalone sie-server and Kubernetes sie-gateway. Component-specific status endpoints are marked below. Cluster-only config and pool endpoints are covered in Config API and Gateway.

Endpoint Summary

Endpoint	Method	Available on	Purpose
`/v1/encode/:model`	POST	`sie-server`, `sie-gateway`	Generate embeddings
`/v1/score/:model`	POST	`sie-server`, `sie-gateway`	Rerank items
`/v1/extract/:model`	POST	`sie-server`, `sie-gateway`	Extract entities and structured data
`/v1/models`	GET	`sie-server`, `sie-gateway`	List available models
`/v1/models/:model`	GET	`sie-server`, `sie-gateway`	Get model details
`/v1/embeddings`	POST	`sie-server`, `sie-gateway`	OpenAI-compatible embeddings
`/healthz`	GET	All runtime components	Liveness probe
`/readyz`	GET	All runtime components	Readiness probe
`/metrics`	GET	All runtime components	Prometheus metrics
`/ws/status`	WebSocket	`sie-server`	Real-time Python `sie-server` status
`/ws/cluster-status`	WebSocket	`sie-gateway`	Cluster status stream

In Kubernetes, encode, score, extract, and embeddings requests hit the Rust gateway first. The gateway publishes msgpack work items to NATS JetStream, then the SIE server sidecar inside the worker pod pulls, batches, calls the sie-server adapter over IPC, and publishes the result back to the gateway.

Wire Format

SIE defaults to msgpack for efficient binary serialization. This preserves numpy arrays natively and produces ~37% smaller payloads than JSON.

Content negotiation:

Content-Type: application/msgpack for requests
Accept: application/msgpack for responses (default)
Accept: application/json returns JSON

When using JSON, arrays are converted to lists.

POST /v1/encode/:model

Generate embeddings for input items. Supports dense, sparse, and multi-vector outputs.

Request Schema

class EncodeRequest(TypedDict, total=False):
    items: list[Item]              # Required: items to encode
    params: EncodeParams           # Optional: encoding parameters

class EncodeParams(TypedDict, total=False):
    output_types: list[str]        # 'dense', 'sparse', 'multivector'
    instruction: str               # Task instruction for query encoding
    output_dtype: str              # 'float32', 'float16', 'int8', 'binary'
    options: dict[str, Any]        # Profile, LoRA, runtime options

class Item(TypedDict, total=False):
    id: str                        # Client-provided ID (echoed back)
    text: str                      # Text content
    images: list[ImageInput]       # Image bytes with format hint

class ImageInput(TypedDict, total=False):
    data: bytes                    # Image bytes
    format: str                    # 'jpeg', 'png', 'webp'

Response Schema

class EncodeResponse(TypedDict, total=False):
    model: str                     # Model name used
    items: list[EncodeResult]      # One result per input item
    timing: TimingInfo             # Server-side timing breakdown

class EncodeResult(TypedDict, total=False):
    id: str                        # Echoed item ID
    dense: DenseVector             # Dense embedding
    sparse: SparseVector           # Sparse embedding
    multivector: MultiVector       # Per-token embeddings

class DenseVector(TypedDict, total=False):
    dims: int                      # Vector dimensionality
    dtype: str                     # 'float32', 'float16', 'int8', 'binary'
    values: list[float]            # Vector values

class SparseVector(TypedDict, total=False):
    dims: int                      # Vocabulary size
    dtype: str                     # Data type
    indices: list[int]             # Non-zero dimension indices
    values: list[float]            # Values at those indices

class MultiVector(TypedDict, total=False):
    token_dims: int                # Per-token embedding dimension
    num_tokens: int                # Number of tokens
    dtype: str                     # Data type
    values: list[list[float]]      # Token embeddings

Request Parameters

Parameter	Type	Default	Description
`items`	`list[Item]`	Required	Items to encode
`params.output_types`	`list[str]`	`["dense"]`	Output types to return
`params.instruction`	`str`	None	Instruction prefix for query encoding
`params.output_dtype`	`str`	`"float32"`	Output precision
`params.options`	`dict`	None	Runtime options (profile, lora, etc.)

Examples

Basic encoding:

curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "Hello, world!"}]
  }'

Multiple output types:

curl -X POST http://localhost:8080/v1/encode/BAAI/bge-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "Search query"}],
    "params": {
      "output_types": ["dense", "sparse"],
      "instruction": "Represent this query for retrieval:"
    }
  }'

Response:

{
  "model": "BAAI/bge-m3",
  "items": [
    {
      "dense": {
        "dims": 1024,
        "dtype": "float32",
        "values": [0.0234, -0.0891, 0.1234, ...]
      },
      "sparse": {
        "dims": 250002,
        "dtype": "float32",
        "indices": [101, 2023, 5789, ...],
        "values": [0.45, 0.32, 0.28, ...]
      }
    }
  ]
}

POST /v1/score/:model

Rerank items against a query using a cross-encoder model.

Request Schema

class ScoreRequest(TypedDict, total=False):
    query: Item                    # Required: query to score against
    items: list[Item]              # Required: items to score
    instruction: str               # Optional instruction
    options: dict[str, Any]        # Runtime options

Response Schema

class ScoreResponse(TypedDict, total=False):
    model: str
    query_id: str | None           # Echoed query ID
    scores: list[ScoreEntry]       # Sorted by score descending

class ScoreEntry(TypedDict):
    item_id: str | None            # Echoed item ID
    score: float                   # Relevance score
    rank: int                      # Position (0 = most relevant)

Example

curl -X POST http://localhost:8080/v1/score/BAAI/bge-reranker-v2-m3 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "query": {"text": "What is machine learning?"},
    "items": [
      {"id": "doc-1", "text": "ML uses algorithms to learn from data."},
      {"id": "doc-2", "text": "The weather is sunny today."}
    ]
  }'

Response:

{
  "model": "BAAI/bge-reranker-v2-m3",
  "scores": [
    {"item_id": "doc-1", "score": 0.891, "rank": 0},
    {"item_id": "doc-2", "score": 0.023, "rank": 1}
  ]
}

POST /v1/extract/:model

Extract structured data from items: entities, relations, classifications, or vision outputs.

Request Schema

class ExtractRequest(TypedDict, total=False):
    items: list[Item]              # Required: items to extract from
    params: ExtractParams          # Optional: extraction parameters

class ExtractParams(TypedDict, total=False):
    labels: list[str]              # Entity types for NER
    output_schema: dict            # JSON schema for structured extraction
    instruction: str               # Task instruction
    options: dict[str, Any]        # Runtime options (see below)

Per-request options

params.options is an adapter-specific dict. Currently supported keys:

Key	Type	Default	Scope	Description
`overflow_policy`	`"default"` \| `"truncate_text"` \| `"error"`	`"default"`	`gliclass-*` family	Controls behavior when `text + label_prompt` exceeds the model’s context (512 tokens for `gliclass-{small,base,large}-v1.0`). `default` passes input through as-is (may surface as `INPUT_TOO_LONG` on these models). `truncate_text` truncates the end of `text` to fit while preserving labels. `error` always raises `INPUT_TOO_LONG` on overflow.

Response Schema

class ExtractResponse(TypedDict, total=False):
    model: str
    items: list[ExtractResult]

class ExtractResult(TypedDict, total=False):
    id: str
    entities: list[Entity]         # NER results
    relations: list[Relation]      # Relation extraction
    classifications: list[Classification]
    objects: list[DetectedObject]  # Object detection
    data: dict[str, Any]           # Structured extraction results

class Entity(TypedDict, total=False):
    text: str                      # Extracted span
    label: str                     # Entity type
    score: float                   # Confidence (0-1)
    start: int                     # Start character offset
    end: int                       # End character offset
    bbox: list[int]                # Bounding box [x, y, w, h] (images)

class Relation(TypedDict):
    head: str                      # Source entity
    tail: str                      # Target entity
    relation: str                  # Relation type
    score: float                   # Confidence

class Classification(TypedDict):
    label: str                     # Class label
    score: float                   # Probability

Example

curl -X POST http://localhost:8080/v1/extract/urchade/gliner_multi-v2.1 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "Tim Cook is the CEO of Apple Inc."}],
    "params": {
      "labels": ["person", "organization", "role"]
    }
  }'

Response:

{
  "model": "urchade/gliner_multi-v2.1",
  "items": [
    {
      "id": "item-0",
      "entities": [
        {"text": "Tim Cook", "label": "person", "score": 0.93, "start": 0, "end": 8},
        {"text": "CEO", "label": "role", "score": 0.88, "start": 16, "end": 19},
        {"text": "Apple Inc", "label": "organization", "score": 0.95, "start": 23, "end": 32}
      ]
    }
  ]
}

Example with overflow_policy on gliclass:

curl -X POST http://localhost:8080/v1/extract/knowledgator/gliclass-small-v1.0 \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "items": [{"text": "<long review text...>"}],
    "params": {
      "labels": ["positive", "negative", "neutral"],
      "options": {"overflow_policy": "truncate_text"}
    }
  }'

When overflow_policy is "error" (or "default" on gliclass-{small,base,large}-v1.0 past the context cap), the server returns HTTP 400:

{
  "detail": {
    "code": "INPUT_TOO_LONG",
    "message": "Item 0: observed 612 tokens (text=540, label_prompt=72) exceeds context cap 512 for knowledgator/gliclass-small-v1.0",
    "model": "knowledgator/gliclass-small-v1.0"
  }
}

GET /v1/models

List all available models with their capabilities.

Response Schema

class ModelsListResponse(BaseModel):
    models: list[ModelInfo]

class ModelInfo(BaseModel):
    name: str                      # Model name
    inputs: list[str]              # Supported inputs: text, image
    outputs: list[str]             # Supported outputs: dense, sparse, multivector
    dims: dict[str, int]           # Dimensions per output type
    loaded: bool                   # Whether model is in GPU memory
    max_sequence_length: int       # Maximum tokens
    profiles: dict[str, ProfileInfo]  # Available profiles

class ProfileInfo(BaseModel):
    is_default: bool               # Whether this is the default profile
    output_types: list[str]        # Output types enabled by this profile
    output_similarity: dict[str, str]  # Similarity metrics per output type

Example

curl -H "Accept: application/json" http://localhost:8080/v1/models

Response:

{
  "models": [
    {
      "name": "BAAI/bge-m3",
      "inputs": ["text"],
      "outputs": ["dense", "sparse", "multivector"],
      "dims": {"dense": 1024, "sparse": 250002, "multivector": 1024},
      "loaded": true,
      "max_sequence_length": 8192,
      "profiles": {}
    },
    {
      "name": "BAAI/bge-reranker-v2-m3",
      "inputs": ["text"],
      "outputs": ["score"],
      "dims": {},
      "loaded": false,
      "max_sequence_length": 8192,
      "profiles": {}
    }
  ]
}

POST /v1/embeddings (OpenAI Compatible)

Drop-in replacement for OpenAI’s embeddings API.

Example

curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "model": "BAAI/bge-m3",
    "input": ["Hello, world!"]
  }'

Response:

{
  "object": "list",
  "model": "BAAI/bge-m3",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0234, -0.0891, ...]
    }
  ],
  "usage": {
    "prompt_tokens": 3,
    "total_tokens": 3
  }
}

Works with OpenAI SDK, LangChain’s OpenAIEmbeddings, and other compatible clients.

Health Endpoints

GET /healthz

Liveness probe. Returns 200 if the server process is running.

curl http://localhost:8080/healthz
# "ok"

GET /readyz

Readiness probe. On standalone sie-server, returns 200 when the Python process is ready to accept traffic. On the gateway, returns 200 when the process can accept requests; it does not wait for SIE server sidecar health or sie-config.

curl http://localhost:8080/readyz
# "ok"

GET /metrics

Prometheus metrics endpoint. Standalone sie-server exposes sie_* adapter metrics. Kubernetes deployments also expose sie_gateway_* on the gateway and sie_worker_* / sie_pull_loop_* from the SIE server sidecar inside each worker pod.

Available Metrics

Metric	Type	Labels	Description
`sie_requests_total`	Counter	model, endpoint, status	Total request count
`sie_request_duration_seconds`	Histogram	model, endpoint, phase	Latency by phase
`sie_batch_size`	Histogram	model	Batch size distribution
`sie_tokens_processed_total`	Counter	model	Total tokens processed
`sie_queue_depth`	Gauge	model	Pending items per model
`sie_model_loaded`	Gauge	model, device	Model load status (1/0)
`sie_model_memory_bytes`	Gauge	model, device	GPU memory per model

WebSocket /ws/status

Real-time Python sie-server status stream. Sends updates every 200ms. In Kubernetes, gateway routing health comes from SIE server sidecar NATS heartbeats; use /ws/cluster-status on the gateway for aggregate cluster status.

Message Schema

{
    "timestamp": float,            # Unix timestamp
    "gpu": str,                    # GPU type (e.g., "l4", "a100-80gb")
    "loaded_models": list[str],    # Currently loaded models
    "server": {
        "version": str,
        "uptime_seconds": int,
        "user": str,
        "working_dir": str,
        "pid": int
    },
    "gpus": [                      # Per-GPU metrics
        {
            "index": int,
            "name": str,
            "gpu_type": str,       # Normalized type (e.g., "l4", "a100-80gb")
            "utilization_percent": float,
            "memory_used_bytes": int,
            "memory_total_bytes": int,
            "memory_threshold_pct": float,
            "temperature_c": int
        }
    ],
    "models": [                    # Per-model status
        {
            "name": str,
            "state": str,          # "loaded", "loading", "unloading", "available"
            "device": str | None,
            "memory_bytes": int,
            "queue_depth": int,
            "queue_pending_items": int,
            "config": {...}        # Model configuration
        }
    ],
    "counters": {...},             # Prometheus counter metrics
    "histograms": {...}            # Prometheus histogram metrics
}

Usage

const ws = new WebSocket("ws://localhost:8080/ws/status");
ws.onmessage = (event) => {
    const status = JSON.parse(event.data);
    console.log(`GPU utilization: ${status.gpus[0].utilization_percent}%`);
};

Error Responses

All endpoints return consistent error responses:

{
  "detail": {
    "code": "MODEL_NOT_FOUND",
    "message": "Model 'unknown-model' not found"
  }
}

Error Codes

Code	HTTP Status	Description
`MODEL_NOT_FOUND`	404	Requested model doesn’t exist
`INVALID_INPUT`	400	Invalid request format
`INPUT_TOO_LONG`	400	Input exceeds model context (extract endpoint, gliclass family)
`MODEL_NOT_LOADED`	503	Model is not loaded or still loading
`LORA_LOADING`	503	LoRA adapter is loading (retry with Retry-After header)
`QUEUE_FULL`	503	Server overloaded, request queue is full
`DEPENDENCY_CONFLICT`	409	Model requires different bundle/dependencies
`INFERENCE_ERROR`	500	Error during model inference
`INTERNAL_ERROR`	500	Unexpected server error

Response Headers

Timing and tracing information is included in response headers:

Header	Description
`X-Total-Time`	Total request time (ms)
`X-Queue-Time`	Time waiting in queue (ms)
`X-Tokenization-Time`	Preprocessing time (ms)
`X-Inference-Time`	GPU inference time (ms)
`X-Postprocessing-Time`	Postprocessing time (ms), only if > 0
`X-Trace-ID`	OpenTelemetry trace ID for distributed tracing