✓
Shimmy Download

Tiny Rust OpenAI-compatible server for local GGUF/SafeTensors models with hot swaps, auto-discovery, and multi-backend GPU/MOE support for drop-in use across editors and SDKs.

⭐ 3,449 stars on GitHub

Latest Release: v1.8.1

About Software
This content is AI-generated. If you want to complain, visit our complaint page.

Shimmy is a 5MB Rust inference server that mirrors OpenAI APIs for local GGUF/SafeTensors models. It auto-discovers models from Hugging Face, Ollama, or local dirs, hot-swaps them, and allocates ports automatically.

It supports CUDA, Vulkan, OpenCL, MLX, and MOE hybrid offloading to fit larger models on constrained GPUs. Editors and SDKs work by just repointing the base URL, with no API keys required for local use.

Use Cases:

Run OpenAI-compatible APIs locally with a tiny Rust single binary
Serve GGUF/SafeTensors models with hot swaps and auto-discovery
Expose drop-in endpoints for SDKs, VSCode, Cursor, and Continue.dev
Leverage GPU/Vulkan/OpenCL/MLX or MOE hybrid offload on limited VRAM
Zero-config startup with auto ports and LoRA detection for local inference

Downloads

v1.8.1 December 08, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.8.0 December 08, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.7.4 October 23, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.7.3 October 12, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.7.2-test6 October 10, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.7.2 October 10, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.7.0 October 08, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.6.0 October 03, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.5.6 September 23, 2025

shimmy.exeexe

Download

v1.5.1 September 19, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.5.0 September 19, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.4.1 September 17, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.4.0 September 17, 2025

shimmy.exeexe

Download

v1.3.3 September 14, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.3.1 September 12, 2025

shimmy-windows-x86_64-v1.3.1.exeexe

Download

v1.2.0 September 10, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v1.0.0 September 08, 2025

shimmy-windows-x86_64.exeexe

Download

shimmy.exeexe

Download

v0.1.1 September 06, 2025

shimmy.exeexe

Download

v0.1.0 September 04, 2025

shimmy-windows-amd64.exeexe

Download

shimmy.exeexe

Download

Package Info

Last Updated This date is from the latest scrape, not necessarily the repo’s own last update.: Dec 08, 2025
Latest Version: v1.8.1
License: MIT
Total Versions: 19

README

The Lightweight OpenAI API Server

🔒 Local Inference Without Dependencies 🚀

License: MIT (https://img.shields.io/badge/License-MIT-yellow.svg) Security (https://img.shields.io/badge/Security-Audited-green) Crates.io (https://img.shields.io/crates/v/shimmy.svg) Downloads (https://img.shields.io/crates/d/shimmy.svg) Rust (https://img.shields.io/badge/rust-stable-brightgreen.svg) GitHub Stars (https://img.shields.io/github/stars/Michael-A-Kuykendall/shimmy?style=social)

💝 Sponsor this project (https://img.shields.io/badge/💝_Sponsor_this_project-ea4aaa?style=for-the-badge&logo=github&logoColor=white)

Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.

💝 Support Shimmy's Growth

🚀 If Shimmy helps you, consider sponsoring (https://github.com/sponsors/Michael-A-Kuykendall) — 100% of support goes to keeping it free forever.

$5/month: Coffee tier ☕ - Eternal gratitude + sponsor badge
$25/month: Bug prioritizer 🐛 - Priority support + name in SPONSORS.md
$100/month: Corporate backer 🏢 - Logo placement + monthly office hours
$500/month: Infrastructure partner 🚀 - Direct support + roadmap input

🎯 Become a Sponsor (https://github.com/sponsors/Michael-A-Kuykendall) | See our amazing sponsors 🙏

Drop-in OpenAI API Replacement for Local LLMs

Shimmy is a 4.8MB single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work — locally, privately, and free.

Developer Tools

Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.

Try it in 30 seconds

# 1) Install + run
cargo install shimmy --features huggingface
shimmy serve &

# 2) See models and pick one
shimmy list

# 3) Smoke test the OpenAI API
curl -s http://127.0.0.1:11435/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
        "model":"REPLACE_WITH_MODEL_FROM_list",
        "messages":[{"role":"user","content":"Say hi in 5 words."}],
        "max_tokens":32
      }' | jq -r '.choices[0].message.content'

🚀 Compatible with OpenAI SDKs and Tools

No code changes needed - just change the API endpoint:

Any OpenAI client: Python, Node.js, curl, etc.
Development applications: Compatible with standard SDKs
VSCode Extensions: Point to http://localhost:11435
Cursor Editor: Built-in OpenAI compatibility
Continue.dev: Drop-in model provider

Use with OpenAI SDKs

Node.js (openai v4)

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "http://127.0.0.1:11435/v1",
  apiKey: "sk-local", // placeholder, Shimmy ignores it
});

const resp = await openai.chat.completions.create({
  model: "REPLACE_WITH_MODEL",
  messages: [{ role: "user", content: "Say hi in 5 words." }],
  max_tokens: 32,
});

console.log(resp.choices[0].message?.content);

Python (openai>=1.0.0)

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")

resp = client.chat.completions.create(
    model="REPLACE_WITH_MODEL",
    messages=[{"role": "user", "content": "Say hi in 5 words."}],
    max_tokens=32,
)

print(resp.choices[0].message.content)

⚡ Zero Configuration Required

Automatically finds models from Hugging Face cache, Ollama, local dirs
Auto-allocates ports to avoid conflicts
Auto-detects LoRA adapters for specialized models
Just works - no config files, no setup wizards

🧠 Advanced MOE (Mixture of Experts) Support

Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:

🔄 CPU MOE Offloading: Automatically distribute model layers across CPU and GPU
🧮 Intelligent Layer Placement: Optimizes which layers run where for maximum performance
💾 Memory Efficiency: Fit larger models in limited VRAM by using system RAM strategically
⚡ Hybrid Acceleration: Get GPU speed where it matters most, CPU reliability everywhere else
🎛️ Configurable: --cpu-moe and --n-cpu-moe flags for fine control

# Enable MOE CPU offloading during installation
cargo install shimmy --features moe

# Run with MOE hybrid processing
shimmy serve --cpu-moe --n-cpu-moe 8

# Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)

Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference

🎯 Perfect for Local Development

Privacy: Your code never leaves your machine
Cost: No API keys, no per-token billing
Speed: Local inference, sub-second responses
Reliability: No rate limits, no downtime

Quick Start (30 seconds)

Installation

🪟 Windows

# RECOMMENDED: Use pre-built binary (no build dependencies required)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy.exe -o shimmy.exe

# OR: Install from source with MOE support
# First install build dependencies:
winget install LLVM.LLVM
# Then install shimmy with MOE:
cargo install shimmy --features moe

# For CUDA + MOE hybrid processing:
cargo install shimmy --features llama-cuda,moe

⚠️ Windows Notes:

Pre-built binary recommended to avoid build dependency issues

MSVC compatibility: Uses shimmy-llama-cpp-2 packages for better Windows support

If Windows Defender flags the binary, add an exclusion or use cargo install

For cargo install: Install LLVM (https://releases.llvm.org/download.html) first to resolve libclang.dll errors

🍎 macOS / 🐧 Linux

# Install from crates.io
cargo install shimmy --features huggingface

GPU Acceleration

Shimmy supports multiple GPU backends for accelerated inference:

🖥️ Available Backends

Backend	Hardware	Installation
CUDA	NVIDIA GPUs	`cargo install shimmy --features llama-cuda`
CUDA + MOE	NVIDIA GPUs + CPU	`cargo install shimmy --features llama-cuda,moe`
Vulkan	Cross-platform GPUs	`cargo install shimmy --features llama-vulkan`
OpenCL	AMD/Intel/Others	`cargo install shimmy --features llama-opencl`
MLX	Apple Silicon	`cargo install shimmy --features mlx`
MOE Hybrid	Any GPU + CPU	`cargo install shimmy --features moe`
All Features	Everything	`cargo install shimmy --features gpu,moe`

🔍 Check GPU Support

# Show detected GPU backends
shimmy gpu-info

⚡ Usage Notes

GPU backends are automatically detected at runtime
Falls back to CPU if GPU is unavailable
Multiple backends can be compiled in, best one selected automatically
Use --gpu-backend to force specific backend

Get Models

Shimmy auto-discovers models from:

Hugging Face cache: ~/.cache/huggingface/hub/
Ollama models: ~/.ollama/models/
Local directory: ./models/
Environment: SHIMMY_BASE_GGUF=path/to/model.gguf

# Download models that work out of the box
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/

Start Server

# Auto-allocates port to avoid conflicts
shimmy serve

# Or use manual port
shimmy serve --bind 127.0.0.1:11435

Point your development tools to the displayed port — VSCode Copilot, Cursor, Continue.dev all work instantly.

✓Shimmy Download

Tiny Rust OpenAI-compatible server for local GGUF/SafeTensors models with hot swaps, auto-discovery, and multi-backend GPU/MOE support for drop-in use across editors and SDKs.

About Software This content is AI-generated. If you want to complain, visit our complaint page.

Categories

Downloads

Package Info

README

The Lightweight OpenAI API Server

🔒 Local Inference Without Dependencies 🚀

💝 Support Shimmy's Growth

Drop-in OpenAI API Replacement for Local LLMs

Developer Tools

Try it in 30 seconds

🚀 Compatible with OpenAI SDKs and Tools

Use with OpenAI SDKs

⚡ Zero Configuration Required

🧠 Advanced MOE (Mixture of Experts) Support

🎯 Perfect for Local Development

Quick Start (30 seconds)

Installation

🪟 Windows

🍎 macOS / 🐧 Linux

GPU Acceleration

🖥️ Available Backends

🔍 Check GPU Support

⚡ Usage Notes

Get Models

Start Server

📦 Download & Install

Related Software

Alt Sendme

Everywhere

File Transfer Go

Bareiron

Lsfg Vk

Winhance

✓
Shimmy Download

About Software
This content is AI-generated. If you want to complain, visit our complaint page.