📄️ Quick start with the Llava models
Llava-v1.6-Vicuna-7B is open-source community's answer to OpenAI's multimodal GPT-4-V. It is also known as a Visual Language Model for its ability to handle visual images and language in a conversation. This guide shows you how to set up and run Llava-v1.6-Vicuna-7B using the LlamaEdge Llama API server server, which provides an OpenAI-compatible API interface.
📄️ Quick start with the Qwen 2.5 VL model
Qwen 2.5 VL is the latest vision-language model from the Qwen series, designed to handle a wide range of complex multimodal tasks. It excels at understanding visual content such as text, charts, and layouts, and can act as an intelligent agent capable of interacting with tools and devices.
📄️ Quick start with the Gemma-3 model
Gemma 3 introduces powerful vision-language capabilities across its 4B, 12B, and 27B models through a custom SigLIP vision encoder, enabling rich interpretation of visual input. It processes fixed-size 896x896 images using a “Pan&Scan” algorithm for adaptive cropping and resizing, balancing detail preservation with computational cost