Multimodal | LlamaEdge

📄️ Quick start with the Llava models

Llava-v1.6-Vicuna-7B is open-source community's answer to OpenAI's multimodal GPT-4-V. It is also known as a Visual Language Model for its ability to handle visual images and language in a conversation. This guide shows you how to set up and run Llava-v1.6-Vicuna-7B using the LlamaEdge Llama API server server, which provides an OpenAI-compatible API interface.

📄️ Quick start with the Qwen 2.5 VL model

Qwen 2.5 VL is the latest vision-language model from the Qwen series, designed to handle a wide range of complex multimodal tasks. It excels at understanding visual content such as text, charts, and layouts, and can act as an intelligent agent capable of interacting with tools and devices.

📄️ Quick start with the Gemma-3 model

Gemma 3 introduces powerful vision-language capabilities across its 4B, 12B, and 27B models through a custom SigLIP vision encoder, enabling rich interpretation of visual input. It processes fixed-size 896x896 images using a “Pan&Scan” algorithm for adaptive cropping and resizing, balancing detail preservation with computational cost

📄️ Quick start with the MedGemma-4b model

MedGemma-4b is a collection of Gemma 3 variants that are trained for performance on medical text and image comprehension. Developers can use MedGemma to accelerate building healthcare-based AI applications. MedGemma currently comes in two variants: a 4B multimodal version and a 27B text-only version.