multimodal-large-language-models

Here are 309 public repositories matching this topic...

BradyFU / Awesome-Multimodal-Large-Language-Models

✨✨Latest Advances on Multimodal Large Language Models

multi-modality instruction-following in-context-learning large-language-models chain-of-thought instruction-tuning visual-instruction-tuning large-vision-language-model multimodal-instruction-tuning large-vision-language-models multimodal-large-language-models multimodal-in-context-learning multimodal-chain-of-thought

Updated Aug 8, 2025

X-PLUG / MobileAgent

Star

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

android agent harmony ios app gui automation mobile copilot multimodal mobile-agents mllm multimodal-large-language-models gpt4v multimodal-agent

Updated Jul 3, 2025
Python

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

svg vlm llm multimodal-large-language-models

Updated Apr 15, 2025
Python

modelscope / ms-agent

Star

MS-Agent: Lightweight Framework for Empowering Agents with Autonomous Exploration

agent data-science code chatbot multi-agents rag gpts llm multimodal-large-language-models qwen assistantapi open-gpts data-science-assistant deep-research agentic-insight

Updated Aug 14, 2025
Python

ictnlp / LLaMA-Omni

Star

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

speech-to-text speech-to-speech large-language-models multimodal-large-language-models speech-language-model speech-interaction

Updated May 19, 2025
Python

VITA-MLLM / VITA

Star

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

multimodal-large-language-models large-multimodal-models

Updated Mar 28, 2025
Python

X-PLUG / mPLUG-DocOwl

Star

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

multimodal table-understanding document-understanding mllm multimodal-large-language-models chart-understanding

Updated May 30, 2025
Python

cambrian-mllm / cambrian

Star

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

computer-vision chatbot representation-learning clip dino large-language-models llms instruction-tuning mllm multimodal-large-language-models

Updated Oct 30, 2024
Python

YangLing0818 / RPG-DiffusionMaster

Star

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

text-to-image image-editting large-language-models multimodal-large-language-models

Updated Feb 1, 2025
Jupyter Notebook

sherlockchou86 / VideoPipe

Star

A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化（视频分析）框架，觉得有帮助的请给个星星 : )

Updated Aug 14, 2025
C++

ByteDance-Seed / Seed1.5-VL

Star

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

cookbook large-language-model vision-language-model multimodal-large-language-models

Updated Jun 14, 2025
Jupyter Notebook

Henry-23 / VideoChat

Star

实时语音交互数字人，支持端到端语音方案（GLM-4-Voice - THG）和级联方案（ASR-LLM-TTS-THG）。可自定义形象与音色，无须训练，支持音色克隆，首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.