Skip to main content

Create knowledge embeddings using the API server

The LlamaEdge API server project demonstrates how to support OpenAI style APIs to upload, chunck, and create embeddings for a text document. In this guide, I will show you how to use those API endpoints as a developer.

This article is intended to demonstrate capabilities of the open source API server example. You should review the API server source code to learn how those features are implemented. If you are running an RAG application with the API server, check out this guide.

Build the API server

Check out the source code and build it using Rust cargo tools.

git clone https://github.com/LlamaEdge/LlamaEdge

cd LlamaEdge/api-server
cargo build --target wasm32-wasi --release

The llama-api-server.wasm file is in the target directory.

cp target/wasm32-wasi/release/llama-api-server.wasm . 

Download models

We will need an LLM and a specialized embedding model. While the LLM technically can create embeddings, specialized embedding models can do it much much better.

# The chat model is Llama2 7b chat
curl -LO https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/Llama-2-7b-chat-hf-Q5_K_M.gguf

# The embedding model is all-MiniLM-L6-v2
curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf

Start the API server

We will now start the API server with both models. The LLM is named default and the embedding model is named embedding. They each have an external facing model name in the --model-name argument.

wasmedge --dir .:. \
--nn-preload default:GGML:AUTO:Llama-2-7b-chat-hf-Q5_K_M.gguf \
--nn-preload embedding:GGML:AUTO:all-MiniLM-L6-v2-ggml-model-f16.gguf \
llama-api-server.wasm -p llama-2-chat,embedding --web-ui ./chatbot-ui \
--model-name Llama-2-7b-chat-hf-Q5_K_M,all-MiniLM-L6-v2-ggml-model-f16 \
--ctx-size 4096,384 \
--log-prompts --log-stat

Create the embeddings

First, we use the /files API to upload a file paris.txt to the API server.

curl -X POST http://127.0.0.1:8080/v1/files -F "file=@paris.txt"

If the command is successful, you should see the similar output as below in your terminal.

{
"id": "file_4bc24593-2a57-4646-af16-028855e7802e",
"bytes": 2161,
"created_at": 1711611801,
"filename": "paris.txt",
"object": "file",
"purpose": "assistants"
}

Next, take the id and request the /chunks API to chunk the file paris.txt into smaller pieces. The reason is that each embedding vector can only hold limited amount of information. The embedding model can "understand" the file content, and determine the optimistic places to break up the text into chunks.

curl -X POST http://localhost:8080/v1/chunks \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"id":"file_4bc24593-2a57-4646-af16-028855e7802e", "filename":"paris.txt"}'

The following is an example return with the generated chunks.

{
"id": "file_4bc24593-2a57-4646-af16-028855e7802e",
"filename": "paris.txt",
"chunks": [
"Paris, city and capital of France, ..., for Paris has retained its importance as a centre for education and intellectual pursuits.",
"Paris’s site at a crossroads ..., drawing to itself much of the talent and vitality of the provinces."
]
}

Finally, use the /embeddings API to generate the embedding vectors. Make sure that you pass in the embedding model name.

curl -X POST http://localhost:8080/v1/embeddings \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"model": "all-MiniLM-L6-v2-ggml-model-f16", "input":["Paris, city and capital of France, ..., for Paris has retained its importance as a centre for education and intellectual pursuits.", "Paris’s site at a crossroads ..., drawing to itself much of the talent and vitality of the provinces."]}'

The embeddings returned are like below.

{
"object": "list",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
0.1428378969,
-0.0447309874,
0.007660218049,
...
-0.0128974719,
-0.03543198109,
0.03974733502,
0.00946635101,
-0.01531364303
]
},
{
"index": 1,
"object": "embedding",
"embedding": [
0.0697753951,
-0.0001159032545,
0.02073983476,
...
0.03565846011,
-0.04550019652,
0.02691745944,
0.02498772368,
-0.003226313973
]
}
],
"model": "all-MiniLM-L6-v2-ggml-model-f16",
"usage": {
"prompt_tokens": 491,
"completion_tokens": 0,
"total_tokens": 491
}
}

Next step

Once you have the embeddings in a JSON file, you can store them into a vector database. It will probably require you to write a script to combine each vector point with its corresponding source text, and then upsert into the database's vector collection. This step will be specific to the vector database and RAG strategy you choose.