LlamaEdge logo

The easiest, smallest and fastest local
LLM runtime and API server.

Quick Start with Gaia
Powered by Rust & WasmEdge (A CNCF hosted project)
info
Lightweight
Runtime + API server is less than 30MB. No external dependency. Zero Python packages.
Very fast
Automagically use the device's local hardware and software acceleration.
picture

Cross-platform LLM agents and web services in Rust or JavaScript

Write once run anywhere, for GPUs
Create an LLM web service on a MacBook, deploy it on a NVIDIA device.
Native to the heterogeneous edge
Orchestrate and move an LLM app across CPUs, GPUs and NPUs.
2~4MB
Inference app
30MB
Total dependency
1000+
Llama2 series of models
100%
Native speed

FAQs

Learn more about LlamaEdge

Q: Why can't I just use the OpenAI API?

A: Hosted LLM APIs are easy to use. But they are also expensive and difficult to customize for your own apps. The hosted LLMs are heavily censored (aligned, or “dumbed down”) generalists. It currently costs you millions of dollars and months of time to ask OpenAI to fine-tune ChatGPT for your own knowledge domain.

Furthermore, hosted LLMs are not private. You are at risk of leaking your data and privacy to the LLM hosting companies. In fact, OpenAI requires you to pay more for a “promise” not to use your interaction data in future training.

Q: Why can't I just start an OpenAI-compatible API server over an open-source model, and then use frameworks like LangChain or LlamaIndex in front of the API to build my app?

A: You sure can! In fact, you can start an OpenAI-compatible API server using LlamaEdge. LlamaEdge automagically utilizes the hardware accelerator and software runtime library in your device.

However, often times, we need an compact and integrated solution, instead of a jumbo mixture of LLM runtime, API server, Python middleware, UI, and other glue code to tie them together.

LlamaEdge provides a set of modular components for you to assemble your own LLM agents and applications like Lego blocks. You can do this entirely in Rust or JavaScript, and compile down to a self-contained application binary that runs without modification across many devices.

Q: Why can't I use Python to run the LLM inference?

A: You can certainly use Python to run LLMs and even start an API server using Python. But keep mind that PyTorch has over 5GB of complex dependencies. These dependencies often conflict with Python toolchains such as LangChain. It is often a nightmare to set up Python dependencies across dev and deployment machines, especially with GPUs and containers.

In contrast, the entire LlamaEdge runtime is less than 30MB. It is has no external dependencies. Just install LlamaEdge and copy over your compiled application file!

Q: Why can't I just use native (C/C++ compiled) inference engines?

A: The biggest issue with native compiled apps is that they are not portable. You must rebuild and retest for each computer you deploy the application. It is a very tedious and error prone progress. LlamaEdge programs are written in Rust (soon JS) and compiled to Wasm. The Wasm app runs as fast as native apps, and is entirely portable.

llamaedge_logo
Copyright © 2024 Second State