|

,

OMLX | LLM inference, optimized for your Mac


OMLX
OMLX

Introduction

oMLX is a high-performance, open-source inference server specifically engineered for Apple Silicon (macOS). It leverages Apple’s MLX framework to provide native, high-speed execution of Large Language Models (LLMs) and Vision-Language Models (VLMs). Its standout innovation is a two-tier ‘Paged SSD KV Cache’ system, which persists memory blocks to the SSD. This architecture prevents the ‘context re-computation’ delay common in other tools like Ollama, making it arguably the most efficient way to run local coding agents (like Claude Code or Cursor) on a Mac without losing speed as conversations grow long.

Use Cases

  • Local Coding Agent Backend
    Use as a drop-in replacement for Anthropic or OpenAI APIs to power Claude Code, Cursor, or OpenClaw. It eliminates the 30–90s ‘waiting for response’ lag by restoring past context from SSD in milliseconds.
  • High-Speed VLM & OCR Processing
    Run vision-based tasks like analyzing UI screenshots or extracting text from complex documents using optimized models like DeepSeek-OCR or Qwen-VL.
  • All-in-One Personal RAG Stack
    Serve an LLM, an Embedding model, and a Reranker simultaneously to build a fully local, private ‘Second Brain’ for your personal notes and files.
  • Concurrent Multi-Request Workflows
    Utilize ‘Continuous Batching’ to handle multiple parallel AI requests from different apps or agents without them queuing behind one another.
  • Hardware Benchmarking
    Use the built-in benchmarking suite to measure the exact tokens-per-second and prefill performance of your specific Mac chip (M1 through M5).

Features & Benefits

  • Two-Tier SSD KV Caching
    Persists conversation memory to disk. Even after a server restart, past context is instantly recovered rather than re-calculated, dropping TTFT (Time to First Token) from minutes to seconds.
  • Native macOS Menu Bar App
    A lightweight, notarized macOS app (not Electron) for monitoring server status, memory pressure, and loading/unloading models with one click.
  • OpenAI & Anthropic API Parity
    Provides native `/v1/messages` and `/v1/chat/completions` endpoints, making it a true drop-in backend for virtually any AI tool.
  • Admin Dashboard & Built-in Chat
    A web interface for managing models, adjusting per-model TTL (Time-to-Live), pinning frequently used models, and a direct ‘Model Arena’ for testing.
  • Model Auto-Detection & Management
    Automatically discovers and categorizes models (LLM, VLM, Embeddings) in your local directories and includes a built-in HuggingFace downloader.

Pros

  • Performance Superiority on Mac
    Benchmarks show 4x faster prefill and significantly higher throughput compared to llama.cpp-based tools like Ollama or LM Studio.
  • Optimized for M5 Architecture
    Fully supports the latest Apple Silicon features, including hardware-specific optimizations for the M5 Max and M5 Ultra.
  • Privacy & Offline Sovereignty
    100% open-source (Apache-2.0) and designed for fully offline operation; all dependencies are vendored, and no telemetry is collected.

Cons

  • macOS Only
    Specifically tied to the Apple Silicon/MLX ecosystem; not available for Windows, Linux, or Intel-based Macs.
  • Memory Intensive
    While efficient, running large models (30B+) comfortably still requires 64GB+ of unified memory for the best experience.

Tutorial

None

Pricing


Popular Products