Skip to main content

API Reference

Introduction

LlamaEdge is an OpenAI compatibale API server. You can also replace the OpenAI API configuration with the LlamaEdge API server in other AI agent frameworks.

The base URL to send all API requests is http://localhost:8080/v1.

Endpoints

Chat

The chat/completions endpoint returns an LLM response based on the system prompt and user query.

Non-streaming

By default, the API responds with a full answer in the HTTP response.

Request

curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the capital of Singapore?"}], "model": "model_name"}'

Response:

{"id":"chatcmpl-bcfeebe0-3342-42c0-ac92-0615213e1c97","object":"chat.completion","created":1716380086,"model":"Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Singapore."},"finish_reason":"stop"}],"usage":{"prompt_tokens":61,"completion_tokens":4,"total_tokens":65}}%  

streaming

Add "stream":true in your request to make the API send back partial responses as the LLM generates its answer.

Request:

curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the capital of France?"}], "model": "model_name", "stream":true}'

Response:

data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}],"created":1716381054,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}

data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":" am"},"logprobs":null,"finish_reason":null}],"created":1716381054,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}

data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":" a"},"logprobs":null,"finish_reason":null}],"created":1716381054,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}

...

data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":" an"},"logprobs":null,"finish_reason":null}],"created":1716381055,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}

data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":" AI"},"logprobs":null,"finish_reason":null}],"created":1716381055,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}

data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":"."},"logprobs":null,"finish_reason":null}],"created":1716381055,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}

data: [DONE]

Request body

FieldTypeRequiredDescriptionDefaultExample
messagesListRequiredA list of messages for the conversation.
1 . System message (depends on the large language mode you use)
* content of the system messages is required
* "role":"system" is required
2. User message (required)
* content is required.
* "role":"user" is required
N/A"messages": ["role": "system","content": "You are a helpful assistant."},{"role": "user",
"content": "Hello!"}]
modelStringRequiredThe chat model you usedN/ALlama-3-8B-262k-Q5_K_M
top_pNumberOptionalAn alternative to sampling with temperature. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.1Number between 0 and 1.
TemperatureNumberOptionalHigher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.1Number between 0 and 2.
presence_penaltyNumberOptionalPositive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.0Number between -2.0 and 2.0.
streambooleanOptionalMake the answer streaming outputFALSE"stream":true
frequency_penaltyNumberOptionalPositive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood of repeating the same line verbatim.0Number between -2.0 and 2.0.

Response body

FieldTypeStreaming or non-streamingDescriptionDefaultExample
idstringBothA unique identifier for the chat completion.Generated randomlychatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4
objectstringBothThe object typechat.completion.chunk in the streaming mode.
chat.completion in the non-streaming mode.
chat.completion.chunk in the streaming mode.
chat.completion in the non-streaming mode.
choicesarrayBothA list of chat completion choices."choices":[{"index":0,"message":{"role":"assistant","content":"Paris."},"finish_reason":"stop"}]
createdintegerBothThe Unix timestamp (in seconds) of when the chat completion was created.N/A1716380086
modelstringBothThe model used for the chat completion.Depends on the model you use.Llama-3-8B-Instruct-Q5_K_M
usageobjectBothUsage statistics for the completion request, including completion_tokens, prompt_tokens, and total_tokens.N/A"usage":{"prompt_tokens":61,"completion_tokens":4,"total_tokens":65}

Embedding

The embeddings endpoint computes embeddings for user queries or file chunks.

Request

curl -X POST http://localhost:8080/v1/embeddings \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"model": "nomic-embed-text-v1.5.f16", "input":["Paris, city and capital of France, ..., for Paris has retained its importance as a centre for education and intellectual pursuits.", "Paris’s site at a crossroads ..., drawing to itself much of the talent and vitality of the provinces."]}'

Response:

{
"object": "list",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
0.1428378969,
-0.0447309874,
0.007660218049,
...
-0.0128974719,
-0.03543198109,
0.03974733502,
0.00946635101,
-0.01531364303
]
},
{
"index": 1,
"object": "embedding",
"embedding": [
0.0697753951,
-0.0001159032545,
0.02073983476,
...
0.03565846011,
-0.04550019652,
0.02691745944,
0.02498772368,
-0.003226313973
]
}
],
"model": "nomic-embed-text-v1.5.f16",
"usage": {
"prompt_tokens": 491,
"completion_tokens": 0,
"total_tokens": 491
}
}

Retrieve

The retrieve endpoint can retrieve text from the model's vector collection based on the user's query.

Request:

curl -X POST http://localhost:8080/v1/retrieve \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the location of Paris?"}], "model":"nomic-embed-text-v1.5.f16"}'

Response:

{
"points": [
{
"source": "\"Paris is located in northern central France, in a north-bending arc of the river Seine whose crest includes two islands, the Île Saint-Louis and the larger Île de la Cité, which form the oldest part of the city. The river's mouth on the English Channel is about 233 mi downstream from the city. The city is spread widely on both banks of the river. Overall, the city is relatively flat, and the lowest point is 35 m above sea level. Paris has several prominent hills, the highest of which is Montmartre at 130 m.\\n\"",
"score": 0.74011195
},
{
"source": "\"The Paris region is the most active water transport area in France, with most of the cargo handled by Ports of Paris in facilities located around Paris. The rivers Loire, Rhine, Rhône, Me\\n\"",
"score": 0.63990676
},
{
"source": "\"Paris\\nCountry\\tFrance\\nRegion\\nÎle-de-France\\r\\nDepartment\\nParis\\nIntercommunality\\nMétropole du Grand Paris\\nSubdivisions\\n20 arrondissements\\nGovernment\\n • Mayor (2020–2026)\\tAnne Hidalgo (PS)\\r\\nArea\\n1\\t105.4 km2 (40.7 sq mi)\\n • Urban\\n (2020)\\t2,853.5 km2 (1,101.7 sq mi)\\n • Metro\\n (2020)\\t18,940.7 km2 (7,313.0 sq mi)\\nPopulation\\n (2023)\\n2,102,650\\n • Rank\\t9th in Europe\\n1st in France\\r\\n • Density\\t20,000/km2 (52,000/sq mi)\\n • Urban\\n (2019)\\n10,858,852\\n • Urban density\\t3,800/km2 (9,900/sq mi)\\n • Metro\\n (Jan. 2017)\\n13,024,518\\n • Metro density\\t690/km2 (1,800/sq mi)\\nDemonym(s)\\nParisian(s) (en) Parisien(s) (masc.), Parisienne(s) (fem.) (fr), Parigot(s) (masc.), \\\"Parigote(s)\\\" (fem.) (fr, colloquial)\\nTime zone\\nUTC+01:00 (CET)\\r\\n • Summer (DST)\\nUTC+02:00 (CEST)\\r\\nINSEE/Postal code\\t75056 /75001-75020, 75116\\r\\nElevation\\t28–131 m (92–430 ft)\\n(avg. 78 m or 256 ft)\\nWebsite\\twww.paris.fr\\r\\n1 French Land Register data, which excludes lakes, ponds, glaciers > 1 km2 (0.386 sq mi or 247 acres) and river estuaries.\\n\"",
"score": 0.62259054
},
{
"source": "\" in Paris\\n\"",
"score": 0.6152092
},
{
"source": "\"The Parisii, a sub-tribe of the Celtic Senones, inhabited the Paris area from around the middle of the 3rd century BC. One of the area's major north–south trade routes crossed the Seine on the île de la Cité, which gradually became an important trading centre. The Parisii traded with many river towns (some as far away as the Iberian Peninsula) and minted their own coins.\\n\"",
"score": 0.5720232
}
],
"limit": 5,
"score_threshold": 0.4
}

Get the model

The models endpoint provides the chat and embedding models that are available on your local port.

Request:

curl -X POST http://localhost:8080/v1/models

Response:

{"object":"list","data":[{"id":"Llama-3-8B-Instruct","created":1716383261,"object":"model","owned_by":"Not specified"},{"id":"nomic-embed-text-v1.5.f16","created":1716383261,"object":"model","owned_by":"Not specified"}]}%   

Status Codes

HTTP response codeDescriptionReasonSolutions
404Not foundThe endpoint URL is invalidPlease check the endpoint URL
500Internal Server ErrorModel is not found.Please check out the model name.
400Bad request