API Reference
Introduction
LlamaEdge is an OpenAI compatibale API server. You can also replace the OpenAI API configuration with the LlamaEdge API server in other AI agent frameworks.
The base URL to send all API requests is http://localhost:8080/v1
.
Endpoints
Chat
The chat/completions
endpoint returns an LLM response based on the system prompt and user query.
Non-streaming
By default, the API responds with a full answer in the HTTP response.
Request
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the capital of Singapore?"}], "model": "model_name"}'
Response:
{"id":"chatcmpl-bcfeebe0-3342-42c0-ac92-0615213e1c97","object":"chat.completion","created":1716380086,"model":"Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Singapore."},"finish_reason":"stop"}],"usage":{"prompt_tokens":61,"completion_tokens":4,"total_tokens":65}}%
streaming
Add "stream":true
in your request to make the API send back partial responses as the LLM generates its answer.
Request:
curl -X POST http://localhost:8080/v1/chat/completions \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the capital of France?"}], "model": "model_name", "stream":true}'
Response:
data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}],"created":1716381054,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}
data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":" am"},"logprobs":null,"finish_reason":null}],"created":1716381054,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}
data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":" a"},"logprobs":null,"finish_reason":null}],"created":1716381054,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}
...
data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":" an"},"logprobs":null,"finish_reason":null}],"created":1716381055,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}
data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":" AI"},"logprobs":null,"finish_reason":null}],"created":1716381055,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}
data: {"id":"chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4","choices":[{"index":0,"delta":{"role":"assistant","content":"."},"logprobs":null,"finish_reason":null}],"created":1716381055,"model":"Llama-3-8B-Instruct","system_fingerprint":"fp_44709d6fcb","object":"chat.completion.chunk"}
data: [DONE]
Request body
Field | Type | Required | Description | Default | Example |
---|---|---|---|---|---|
messages | List | Required | A list of messages for the conversation. 1 . System message (depends on the large language mode you use) * content of the system messages is required * "role":"system" is required2. User message (required) * content is required. * "role":"user" is required | N/A | "messages": ["role": "system","content": "You are a helpful assistant."},{"role": "user", "content": "Hello!"}] |
model | String | Required | The chat model you used | N/A | Llama-3-8B-262k-Q5_K_M |
top_p | Number | Optional | An alternative to sampling with temperature. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. | 1 | Number between 0 and 1. |
Temperature | Number | Optional | Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. | 1 | Number between 0 and 2. |
presence_penalty | Number | Optional | Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. | 0 | Number between -2.0 and 2.0. |
stream | boolean | Optional | Make the answer streaming output | FALSE | "stream":true |
frequency_penalty | Number | Optional | Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood of repeating the same line verbatim. | 0 | Number between -2.0 and 2.0. |
Response body
Field | Type | Streaming or non-streaming | Description | Default | Example |
---|---|---|---|---|---|
id | string | Both | A unique identifier for the chat completion. | Generated randomly | chatcmpl-73a1f57d-185e-42c2-b8a6-ba0bae58f3b4 |
object | string | Both | The object type | chat.completion.chunk in the streaming mode.chat.completion in the non-streaming mode. | chat.completion.chunk in the streaming mode.chat.completion in the non-streaming mode. |
choices | array | Both | A list of chat completion choices. | "choices":[{"index":0,"message":{"role":"assistant","content":"Paris."},"finish_reason":"stop"}] | |
created | integer | Both | The Unix timestamp (in seconds) of when the chat completion was created. | N/A | 1716380086 |
model | string | Both | The model used for the chat completion. | Depends on the model you use. | Llama-3-8B-Instruct-Q5_K_M |
usage | object | Both | Usage statistics for the completion request, including completion_tokens, prompt_tokens, and total_tokens. | N/A | "usage":{"prompt_tokens":61,"completion_tokens":4,"total_tokens":65} |
Embedding
The embeddings
endpoint computes embeddings for user queries or file chunks.
Request
curl -X POST http://localhost:8080/v1/embeddings \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"model": "nomic-embed-text-v1.5.f16", "input":["Paris, city and capital of France, ..., for Paris has retained its importance as a centre for education and intellectual pursuits.", "Paris’s site at a crossroads ..., drawing to itself much of the talent and vitality of the provinces."]}'
Response:
{
"object": "list",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
0.1428378969,
-0.0447309874,
0.007660218049,
...
-0.0128974719,
-0.03543198109,
0.03974733502,
0.00946635101,
-0.01531364303
]
},
{
"index": 1,
"object": "embedding",
"embedding": [
0.0697753951,
-0.0001159032545,
0.02073983476,
...
0.03565846011,
-0.04550019652,
0.02691745944,
0.02498772368,
-0.003226313973
]
}
],
"model": "nomic-embed-text-v1.5.f16",
"usage": {
"prompt_tokens": 491,
"completion_tokens": 0,
"total_tokens": 491
}
}
Retrieve
The retrieve
endpoint can retrieve text from the model's vector collection based on the user's query.
Request:
curl -X POST http://localhost:8080/v1/retrieve \
-H 'accept:application/json' \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "What is the location of Paris?"}], "model":"nomic-embed-text-v1.5.f16"}'
Response:
{
"points": [
{
"source": "\"Paris is located in northern central France, in a north-bending arc of the river Seine whose crest includes two islands, the Île Saint-Louis and the larger Île de la Cité, which form the oldest part of the city. The river's mouth on the English Channel is about 233 mi downstream from the city. The city is spread widely on both banks of the river. Overall, the city is relatively flat, and the lowest point is 35 m above sea level. Paris has several prominent hills, the highest of which is Montmartre at 130 m.\\n\"",
"score": 0.74011195
},
{
"source": "\"The Paris region is the most active water transport area in France, with most of the cargo handled by Ports of Paris in facilities located around Paris. The rivers Loire, Rhine, Rhône, Me\\n\"",
"score": 0.63990676
},
{
"source": "\"Paris\\nCountry\\tFrance\\nRegion\\nÎle-de-France\\r\\nDepartment\\nParis\\nIntercommunality\\nMétropole du Grand Paris\\nSubdivisions\\n20 arrondissements\\nGovernment\\n • Mayor (2020–2026)\\tAnne Hidalgo (PS)\\r\\nArea\\n1\\t105.4 km2 (40.7 sq mi)\\n • Urban\\n (2020)\\t2,853.5 km2 (1,101.7 sq mi)\\n • Metro\\n (2020)\\t18,940.7 km2 (7,313.0 sq mi)\\nPopulation\\n (2023)\\n2,102,650\\n • Rank\\t9th in Europe\\n1st in France\\r\\n • Density\\t20,000/km2 (52,000/sq mi)\\n • Urban\\n (2019)\\n10,858,852\\n • Urban density\\t3,800/km2 (9,900/sq mi)\\n • Metro\\n (Jan. 2017)\\n13,024,518\\n • Metro density\\t690/km2 (1,800/sq mi)\\nDemonym(s)\\nParisian(s) (en) Parisien(s) (masc.), Parisienne(s) (fem.) (fr), Parigot(s) (masc.), \\\"Parigote(s)\\\" (fem.) (fr, colloquial)\\nTime zone\\nUTC+01:00 (CET)\\r\\n • Summer (DST)\\nUTC+02:00 (CEST)\\r\\nINSEE/Postal code\\t75056 /75001-75020, 75116\\r\\nElevation\\t28–131 m (92–430 ft)\\n(avg. 78 m or 256 ft)\\nWebsite\\twww.paris.fr\\r\\n1 French Land Register data, which excludes lakes, ponds, glaciers > 1 km2 (0.386 sq mi or 247 acres) and river estuaries.\\n\"",
"score": 0.62259054
},
{
"source": "\" in Paris\\n\"",
"score": 0.6152092
},
{
"source": "\"The Parisii, a sub-tribe of the Celtic Senones, inhabited the Paris area from around the middle of the 3rd century BC. One of the area's major north–south trade routes crossed the Seine on the île de la Cité, which gradually became an important trading centre. The Parisii traded with many river towns (some as far away as the Iberian Peninsula) and minted their own coins.\\n\"",
"score": 0.5720232
}
],
"limit": 5,
"score_threshold": 0.4
}
Get the model
The models
endpoint provides the chat and embedding models that are available on your local port.
Request:
curl -X POST http://localhost:8080/v1/models
Response:
{"object":"list","data":[{"id":"Llama-3-8B-Instruct","created":1716383261,"object":"model","owned_by":"Not specified"},{"id":"nomic-embed-text-v1.5.f16","created":1716383261,"object":"model","owned_by":"Not specified"}]}%
Status Codes
HTTP response code | Description | Reason | Solutions |
---|---|---|---|
404 | Not found | The endpoint URL is invalid | Please check the endpoint URL |
500 | Internal Server Error | Model is not found. | Please check out the model name. |
400 | Bad request |