Lemonade is a high-performance local LLM serving platform that intelligently configures state-of-the-art inference engines for NPUs and GPUs. It supports multiple model formats (GGUF, FLM, ONNX) and provides seamless switching between CPU, GPU (Vulkan/ROCm/Metal), and NPU acceleration, with a focus on AMD Ryzen AI 300 series processors.
The platform includes a desktop app with built-in model manager, chat interface, and OpenAI-compatible API server. It features CLI tools for model management, performance benchmarking, and memory profiling. Used by startups, research teams at Stanford, and AMD for production LLM deployments.
Use Cases:
Download | Documentation | Discord
Lemonade helps users run local LLMs with the highest performance by configuring state-of-the-art inference engines for their NPUs and GPUs.
Startups such as Styrk AI (https://styrk.ai/styrk-ai-and-amd-guardrails-for-your-on-device-ai-revolution/), research teams like Hazy Research at Stanford (https://www.amd.com/en/developer/resources/technical-articles/2025/minions--on-device-and-cloud-language-model-collaboration-on-ryz.html), and large companies like AMD (https://www.amd.com/en/developer/resources/technical-articles/unlocking-a-wave-of-llm-apps-on-ryzen-ai-through-lemonade-server.html) use Lemonade to run LLMs.
Want your app featured here? Discord · GitHub Issue · Email
To run and chat with Gemma 3:
lemonade-server run Gemma-3-4b-it-GGUF
To install models ahead of time, use the pull command:
lemonade-server pull Gemma-3-4b-it-GGUF
To check all models available, use the list command:
lemonade-server list
Tip: You can use
--llamacpp vulkan/rocmto select a backend when running GGUF models.
Lemonade supports GGUF, FLM, and ONNX models across CPU, GPU, and NPU (see supported configurations).
Use lemonade-server pull or the built-in Model Manager to download models. You can also import custom GGUF/ONNX models from Hugging Face.
Browse all built-in models → (https://lemonade-server.ai/docs/server/server_models/)
Lemonade supports the following configurations, while also making it easy to switch between them at runtime. Find more information about it here.
| Hardware | Engine: OGA | Engine: llamacpp | Engine: FLM | Windows | Linux |
|---|---|---|---|---|---|
| 🧠 CPU | All platforms | All platforms | - | ✅ | ✅ |
| 🎮 GPU | — | Vulkan: All platformsROCm: Selected AMD platforms*Metal: Apple Silicon | — | ✅ | ✅ |
| 🤖 NPU | AMD Ryzen™ AI 300 series | — | Ryzen™ AI 300 series | ✅ | — |
See supported AMD ROCm platforms
Architecture
Platform Support
GPU Models
gfx1151 (STX Halo)
Windows, Ubuntu
Ryzen AI MAX+ Pro 395
gfx120X (RDNA4)
Windows, Ubuntu
Radeon AI PRO R9700, RX 9070 XT/GRE/9070, RX 9060 XT
gfx110X (RDNA3)
Windows, Ubuntu
Radeon PRO W7900/W7800/W7700/V710, RX 7900 XTX/XT/GRE, RX 7800 XT, RX 7700 XT
| Under Development | Under Consideration | Recently Completed |
|---|---|---|
| Image Generation | vLLM support | General speech-to-text support (whisper.cpp) |
| Add imagegen and transcription to app | Handheld devices: Ryzen AI Z2 Extreme APUs | Multiple models loaded at the same time |
| ROCm support for Ryzen AI 360-375 (Strix) APUs | Text to speech | Lemonade desktop app |
You can use any OpenAI-compatible client library by configuring it to use http://localhost:8000/api/v1 as the base URL. A table containing official and popular OpenAI clients on different languages is shown below.
Feel free to pick and choose your preferred language.
| Python | C++ | Java | C# | Node.js | Go | Ruby | Rust | PHP |
|---|---|---|---|---|---|---|---|---|
| openai-python (https://github.com/openai/openai-python) | openai-cpp (https://github.com/olrea/openai-cpp) | openai-java (https://github.com/openai/openai-java) | openai-dotnet (https://github.com/openai/openai-dotnet) | openai-node (https://github.com/openai/openai-node) | go-openai (https://github.com/sashabaranov/go-openai) | ruby-openai (https://github.com/alexrudall/ruby-openai) | async-openai (https://github.com/64bit/async-openai) | openai-php (https://github.com/openai-php/client) |
from openai import OpenAI
# Initialize the client to use Lemonade Server
client = OpenAI(
base_url="http://localhost:8000/api/v1",
api_key="lemonade" # required but unused
)
# Create a chat completion
completion = client.chat.completions.create(
model="Llama-3.2-1B-Instruct-Hybrid", # or any other available model
messages=[
{"role": "user", "content": "What is the capital of France?"}
]
)
# Print the response
print(completion.choices[0].message.content)
For more detailed integration instructions, see the Integration Guide.
The Lemonade Python SDK is also available, which includes the following components:
lemonade CLI lets you mix-and-match LLMs (ONNX, GGUF, SafeTensors) with prompting templates, accuracy testing, performance benchmarking, and memory profiling to characterize your models on your hardware.To read our frequently asked questions, see our FAQ Guide
We are actively seeking collaborators from across the industry. If you would like to contribute to this project, please check out our contribution guide.
New contributors can find beginner-friendly issues tagged with "Good First Issue" to get started.
This project is sponsored by AMD. It is maintained by @danielholanda @jeremyfowers @ramkrishna @vgodsoe in equal measure. You can reach us by filing an issue (https://github.com/lemonade-sdk/lemonade/issues), emailing [email protected], or joining our Discord (https://discord.gg/5xXzkMu8Zk).
This project is:
Free open-source screen recorder for product demos with zoom controls, custom backgrounds, annotations, and no watermarks or subscriptions
AI-powered terminal with natural language commands, voice control, smart completion, and enterprise security features for EC2, databases, and Kubernetes
Multi-IDE maintenance toolkit extending free AugmentCode trials with cleanup engines database management code patching and automated backups
Official mobile Cherry Studio app for iOS/Android providing multi-LLM conversations AI assistants and theme support via React Native
Kubernetes log visualization tool transforming audit logs into interactive timelines and cluster diagrams for agentless troubleshooting across GKE and OSS clusters
PowerPoint plugin for scientific presentations with image auto-captions grid layouts LaTeX formulas code blocks and Markdown insertion capabilities