Shimmy is a 5MB Rust inference server that mirrors OpenAI APIs for local GGUF/SafeTensors models. It auto-discovers models from Hugging Face, Ollama, or local dirs, hot-swaps them, and allocates ports automatically.
It supports CUDA, Vulkan, OpenCL, MLX, and MOE hybrid offloading to fit larger models on constrained GPUs. Editors and SDKs work by just repointing the base URL, with no API keys required for local use.
Use Cases:
License: MIT (https://img.shields.io/badge/License-MIT-yellow.svg) Security (https://img.shields.io/badge/Security-Audited-green) Crates.io (https://img.shields.io/crates/v/shimmy.svg) Downloads (https://img.shields.io/crates/d/shimmy.svg) Rust (https://img.shields.io/badge/rust-stable-brightgreen.svg) GitHub Stars (https://img.shields.io/github/stars/Michael-A-Kuykendall/shimmy?style=social)
💝 Sponsor this project (https://img.shields.io/badge/💝_Sponsor_this_project-ea4aaa?style=for-the-badge&logo=github&logoColor=white)
Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.
🚀 If Shimmy helps you, consider sponsoring (https://github.com/sponsors/Michael-A-Kuykendall) — 100% of support goes to keeping it free forever.
🎯 Become a Sponsor (https://github.com/sponsors/Michael-A-Kuykendall) | See our amazing sponsors 🙏
Shimmy is a 4.8MB single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work — locally, privately, and free.
Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.
# 1) Install + run
cargo install shimmy --features huggingface
shimmy serve &
# 2) See models and pick one
shimmy list
# 3) Smoke test the OpenAI API
curl -s http://127.0.0.1:11435/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"REPLACE_WITH_MODEL_FROM_list",
"messages":[{"role":"user","content":"Say hi in 5 words."}],
"max_tokens":32
}' | jq -r '.choices[0].message.content'
No code changes needed - just change the API endpoint:
http://localhost:11435import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "http://127.0.0.1:11435/v1",
apiKey: "sk-local", // placeholder, Shimmy ignores it
});
const resp = await openai.chat.completions.create({
model: "REPLACE_WITH_MODEL",
messages: [{ role: "user", content: "Say hi in 5 words." }],
max_tokens: 32,
});
console.log(resp.choices[0].message?.content);
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")
resp = client.chat.completions.create(
model="REPLACE_WITH_MODEL",
messages=[{"role": "user", "content": "Say hi in 5 words."}],
max_tokens=32,
)
print(resp.choices[0].message.content)
Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:
--cpu-moe and --n-cpu-moe flags for fine control# Enable MOE CPU offloading during installation
cargo install shimmy --features moe
# Run with MOE hybrid processing
shimmy serve --cpu-moe --n-cpu-moe 8
# Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)
Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference
# RECOMMENDED: Use pre-built binary (no build dependencies required)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy.exe -o shimmy.exe
# OR: Install from source with MOE support
# First install build dependencies:
winget install LLVM.LLVM
# Then install shimmy with MOE:
cargo install shimmy --features moe
# For CUDA + MOE hybrid processing:
cargo install shimmy --features llama-cuda,moe
⚠️ Windows Notes:
- Pre-built binary recommended to avoid build dependency issues
- MSVC compatibility: Uses
shimmy-llama-cpp-2packages for better Windows support- If Windows Defender flags the binary, add an exclusion or use
cargo install- For
cargo install: Install LLVM (https://releases.llvm.org/download.html) first to resolvelibclang.dllerrors
# Install from crates.io
cargo install shimmy --features huggingface
Shimmy supports multiple GPU backends for accelerated inference:
| Backend | Hardware | Installation |
|---|---|---|
| CUDA | NVIDIA GPUs | cargo install shimmy --features llama-cuda |
| CUDA + MOE | NVIDIA GPUs + CPU | cargo install shimmy --features llama-cuda,moe |
| Vulkan | Cross-platform GPUs | cargo install shimmy --features llama-vulkan |
| OpenCL | AMD/Intel/Others | cargo install shimmy --features llama-opencl |
| MLX | Apple Silicon | cargo install shimmy --features mlx |
| MOE Hybrid | Any GPU + CPU | cargo install shimmy --features moe |
| All Features | Everything | cargo install shimmy --features gpu,moe |
# Show detected GPU backends
shimmy gpu-info
--gpu-backend to force specific backendShimmy auto-discovers models from:
~/.cache/huggingface/hub/~/.ollama/models/./models/SHIMMY_BASE_GGUF=path/to/model.gguf# Download models that work out of the box
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/
# Auto-allocates port to avoid conflicts
shimmy serve
# Or use manual port
shimmy serve --bind 127.0.0.1:11435
Point your development tools to the displayed port — VSCode Copilot, Cursor, Continue.dev all work instantly.
Peer-to-peer, end-to-end encrypted file and folder transfer over QUIC with resumable downloads, no accounts, and cross-platform desktop builds.
Context-aware Windows overlay assistant that reads your screen and delivers translations, summaries, and answers via multi-LLM backends with a sleek keyboard-driven UI.
WebRTC P2P tool for files, text, and desktop sharing with end-to-end encryption, ACK reliability, Docker/single-binary deploys, and a responsive Next.js UI.
Ultra-lightweight Minecraft server for embedded and low-RAM systems, trading vanilla completeness for performance with configurable globals and cross-platform polyglot binaries.
Vulkan layer that brings Lossless Scaling frame generation to Linux/Steam Deck, with a GUI configurator, benchmarks, and per-game tuning.
Windows 10/11 debloat and optimization suite that manages apps, privacy, performance, and UI tweaks, plus ISO/autounattend creation and reusable config exports.