Skip to main content

Start an OpenAI compatible API server

LlamaEdge support running LLMs along with embbedding models, allowing you to start a drop-in replacement for OpenAI API.

Step 1: Install WasmEdge

First off, you'll need WasmEdge. To install WasmEdge along with the necessary plugin for AI inference, open your terminal and execute the following command:

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash -s

This command fetches and runs the WasmEdge installation script, which automatically installs WasmEdge and the WASI-NN plugin, essential for running LLM models like Llama 3.1 and Nomix-embed models.

Step 2: Download the LLM Model and Embedding Model

Next, you'll need to obtain a model file. For this tutorial, we're focusing on the Llama 3.2 1B model finetuned for instruction following and Nomic embed model, but the steps are generally applicable to other models too. Use the following command to download the model files.

# The chat model is Llama 3.2 1b chat
curl -LO https://huggingface.co/second-state/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q5_K_M.gguf

# The embedding model is nomic-embed-text-v1.5
curl -LO https://huggingface.co/second-state/Nomic-embed-text-v1.5-Embedding-GGUF/resolve/main/nomic-embed-text-v1.5-f16.gguf

This command downloads the Llama-3.2-1B-Instruct model and nomic-embed-text-v1.5 model from Huggingface, an AI model hosting platform.

Step 3: Download a Portable OpenAI Compatible Server

To start an OpenAI-compatible API server, you need the LlamaEdge API server app.

curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-api-server.wasm

The llama-api-server.wasm is a web server with an OpenAI-compatible API.

The LlamaEdge apps are written in Rust and compiled to portable Wasm. That means they can run across devices and OSes without any change to the binary apps. You can simply download and run the compiled wasm apps regardless of your platform.

Step 4: Start the API Server

With everything set up, it's time to run the models as follows.

wasmedge --dir .:. \
--nn-preload default:GGML:AUTO:Llama-3.2-1B-Instruct-Q5_K_M.gguf \
--nn-preload embedding:GGML:AUTO:nomic-embed-text-v1.5-f16.gguf \
llama-api-server.wasm -p llama-3-chat,embedding \
--model-name Llama-3.2-1B-Instruct-Q5_K_M,nomic-embed-text-v1.5-f16 \
--ctx-size 8192,8192 \
--batch-size 128,8192 \
--log-prompts --log-stat

This command executes the chat application, allowing you to start interacting with the Llama 3 8B model. Here, wasmedge is the command to run the WasmEdge runtime, --nn-preload specifies the model to use with the WASI-NN plugin, and -p sets the prompt template for the chat.

Step 5: Send an API Request

Now you have a drop-in replacement for OpenAI API. You can integrate it with any agents/frameworks based on OpenAI.

Config optionValue
API endpoint URLhttp://localhost:8080/v1
Model Name (for LLM)Llama-3.2-1B-Instruct-Q5_K_M
Model Name (for Text embedding)nomic-embed-text-v1.5-f16
API keyEmpty or any value

Congratulations! Next, you can integrate your APT server with OpenAI ecosystem apps.