OMLX | LLM inference, optimized for your Mac

OMLX

Introduction

oMLX is a high-performance, open-source inference server specifically engineered for Apple Silicon (macOS). It leverages Apple’s MLX framework to provide native, high-speed execution of Large Language Models (LLMs) and Vision-Language Models (VLMs). Its standout innovation is a two-tier ‘Paged SSD KV Cache’ system, which persists memory blocks to the SSD. This architecture prevents the ‘context re-computation’ delay common in other tools like Ollama, making it arguably the most efficient way to run local coding agents (like Claude Code or Cursor) on a Mac without losing speed as conversations grow long.

Use Cases

Local Coding Agent Backend
Use as a drop-in replacement for Anthropic or OpenAI APIs to power Claude Code, Cursor, or OpenClaw. It eliminates the 30–90s ‘waiting for response’ lag by restoring past context from SSD in milliseconds.
High-Speed VLM & OCR Processing
Run vision-based tasks like analyzing UI screenshots or extracting text from complex documents using optimized models like DeepSeek-OCR or Qwen-VL.
All-in-One Personal RAG Stack
Serve an LLM, an Embedding model, and a Reranker simultaneously to build a fully local, private ‘Second Brain’ for your personal notes and files.
Concurrent Multi-Request Workflows
Utilize ‘Continuous Batching’ to handle multiple parallel AI requests from different apps or agents without them queuing behind one another.
Hardware Benchmarking
Use the built-in benchmarking suite to measure the exact tokens-per-second and prefill performance of your specific Mac chip (M1 through M5).

Features & Benefits

Two-Tier SSD KV Caching
Persists conversation memory to disk. Even after a server restart, past context is instantly recovered rather than re-calculated, dropping TTFT (Time to First Token) from minutes to seconds.
Native macOS Menu Bar App
A lightweight, notarized macOS app (not Electron) for monitoring server status, memory pressure, and loading/unloading models with one click.
OpenAI & Anthropic API Parity
Provides native `/v1/messages` and `/v1/chat/completions` endpoints, making it a true drop-in backend for virtually any AI tool.
Admin Dashboard & Built-in Chat
A web interface for managing models, adjusting per-model TTL (Time-to-Live), pinning frequently used models, and a direct ‘Model Arena’ for testing.
Model Auto-Detection & Management
Automatically discovers and categorizes models (LLM, VLM, Embeddings) in your local directories and includes a built-in HuggingFace downloader.

Visit Website

Pros

Performance Superiority on Mac
Benchmarks show 4x faster prefill and significantly higher throughput compared to llama.cpp-based tools like Ollama or LM Studio.
Optimized for M5 Architecture
Fully supports the latest Apple Silicon features, including hardware-specific optimizations for the M5 Max and M5 Ultra.
Privacy & Offline Sovereignty
100% open-source (Apache-2.0) and designed for fully offline operation; all dependencies are vendored, and no telemetry is collected.

Cons

macOS Only
Specifically tied to the Apple Silicon/MLX ecosystem; not available for Windows, Linux, or Intel-based Macs.
Memory Intensive
While efficient, running large models (30B+) comfortably still requires 64GB+ of unified memory for the best experience.