inference-server

Star

Here are 106 public repositories matching this topic...

jundot / omlx

Star

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

macos inference-server mlx apple-silicon openai-api llm

Updated Jun 14, 2026
Python

Michael-A-Kuykendall / shimmy

Sponsor

Star

⚡ Pure-Rust WebGPU inference engine — OpenAI-API compatible, GGUF native, runs on any GPU. No Python. No llama.cpp. Single binary.

Updated Jun 11, 2026
Rust

RamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.

ai containers cuda intel hip hacktoberfest inference-server podman llm llamacpp vllm

Updated Jun 12, 2026
Python

roboflow / inference

Star

Turn any computer or edge device into a command center for your computer vision projects.

Updated Jun 12, 2026
Python

superlinked / sie

Star

Open-source inference server and production cluster for all the models your agent needs.

Updated Jun 13, 2026
Python

basetenlabs / truss

Star

The simplest way to serve AI/ML models in production

open-source machine-learning packaging artificial-intelligence falcon easy-to-use whisper inference-server model-serving inference-api stable-diffusion wizardlm

Updated Jun 13, 2026
Python

pipeless-ai / pipeless

Star

An open-source computer vision framework to build and deploy apps in minutes

Updated May 8, 2024
Rust

underneathall / pinferencia

Star

Python + Inference - Model Deployment library in Python. Simplest model inference server ever.

Updated Feb 14, 2023
Python

NVIDIA / gpu-rest-engine

Star

A REST API for Caffe using Docker and Go

docker caffe deep-learning gpu inference inference-server

Updated Jul 20, 2018
C++

aiptimizer / TurboOCR

Star

Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.

ocr grpc nvidia text-recognition text-detection inference-server fp16 tensorrt rag fastapi pdf-extraction paddleocr easyocr document-ai document-parsing qwen-vl gpu-ocr

Updated Jun 11, 2026
C++

Epistates / pmetal

Star

PMetal: high-performance Apple Silicon framework for local LLM inference, LoRA/QLoRA fine-tuning, serving, quantization, and MLX/Metal acceleration.

Updated Jun 5, 2026
Rust

containers / podman-desktop-extension-ai-lab

Star

Work with LLMs on a local environment using containers

ai local containers inference-server podman llms

Updated Jun 10, 2026
TypeScript

BMW-InnovationLab / BMW-YOLOv4-Inference-API-GPU

Star

This is a repository for an nocode object detection inference API using the Yolov3 and Yolov4 Darknet framework.

Updated Jun 28, 2022
Python

raketenkater / llm-server

Star

Auto-tuned launcher for GGUF models on llama.cpp / ik_llama.cpp — OpenAI-compatible server with multi-GPU tensor-split, MoE expert placement, measured flag tuning (AI Tune), hardware-matched HuggingFace downloads, and crash recovery. An Ollama alternative for multi-GPU rigs.