A real-time webcam application powered by SmolVLM (Small Vision Language Model) running on Apple Silicon using MLX-VLM. This application provides a simple web interface where you can analyze webcam footage in real-time using AI vision language models(mlx-community/SmolVLM-Instruct-4bit).
This repository features a simple demo of real-time object detection using MLX-VLM with mlx-community/SmolVLM-Instruct-4bit, optimized for M1 MacBook Pro.
For improved output quality, you can switch to SmolVLM-Instruct-8bit, though it may require a faster Apple Silicon chip for faster performance.
- 🎥 Real-time Webcam Analysis - Capture and analyze webcam frames instantly
- 🧠 SmolVLM Integration - Powered by efficient SmolVLM models via MLX-VLM
- 🌐 Web Interface - Simple, responsive web UI with modern design
- ⚡ Real-time Processing - Fast inference on Apple Silicon devices
- 🎛️ Customizable Settings - Adjust prompts, temperature, tokens, and auto-analysis
- 📱 Mobile Friendly - Responsive design works on various screen sizes
- 🔄 Auto Analysis - Optional automatic frame analysis at set intervals
- Apple Silicon Mac (M1, M2, M3, or newer)
- Python 3.10+
- Webcam or camera access
-
Clone or download the application
# If you have the file locally, navigate to the directory cd /path/to/your/mlx-projects
-
Install dependencies
# Install MLX-VLM (the key dependency) pip install mlx-vlm # Install web server dependencies pip install flask flask-socketio # Install image processing pip install pillow
-
Run the application
python mlx_smolvlm_webcam.py --model mlx-community/SmolVLM-Instruct-4bit --port 8080
-
Open your browser
- Navigate to:
http://localhost:8080
- Click "Start Camera" to enable webcam
- Click "📸 Analyze Frame" to get AI descriptions
- Navigate to:
pip install mlx-vlm flask flask-socketio pillow
mlx-community/SmolVLM-Instruct-4bit
(default, recommended)mlx-community/SmolVLM-Instruct
- Other SmolVLM models from mlx-community
python mlx_smolvlm_webcam.py --model mlx-community/SmolVLM-Instruct-4bit
python mlx_smolvlm_webcam.py \
--model mlx-community/SmolVLM-Instruct-4bit \
--host 127.0.0.1 \
--port 8080 \
--debug
--model
: HuggingFace model ID (default:mlx-community/SmolVLM-Instruct-4bit
)--host
: Server host (default:127.0.0.1
)--port
: Server port (default:8080
)--debug
: Enable debug mode
- Start Camera: Enable webcam access
- 📸 Analyze Frame: Capture and analyze current frame
- ⏸️ Pause/
▶️ Resume: Toggle camera feed
- Custom Prompt: Customize what you want the AI to describe
- Max Tokens: Control response length (5-50)
- Temperature: Adjust creativity/randomness (0.1-1.0)
- Auto Analyze: Automatic analysis every .5/1/1.5/2/2.5/3/5/10 seconds or Manual
- "Describe what you see in this image in detail"
- "What objects are visible in this scene?"
- "Analyze the emotions and expressions of people in this image"
- "Describe the lighting and composition of this scene"
- "What activities are taking place in this image?"
"Module not found: flask_socketio"
pip install flask-socketio
"Model type idefics3 not supported"
- Make sure you're using
mlx-vlm
notmlx-lm
pip uninstall mlx-lm
pip install mlx-vlm
"Port already in use"
# Use a different port
python mlx_smolvlm_webcam.py --port 8080
Camera permission denied
- Allow camera access in your browser
- Check System Preferences > Security & Privacy > Camera
Model loading fails
# Clear HuggingFace cache and retry
rm -rf ~/.cache/huggingface/
python mlx_smolvlm_webcam.py --model mlx-community/SmolVLM-Instruct-4bit
-
Use 4-bit models for faster inference:
--model mlx-community/SmolVLM-Instruct-4bit
-
Adjust image size - App automatically resizes to 512px max dimension
-
Lower max tokens for faster responses
-
Use auto-analyze sparingly to avoid overwhelming the model
- Backend: Flask + SocketIO for real-time communication
- Frontend: Modern HTML5 + JavaScript with WebSocket
- AI Model: SmolVLM via MLX-VLM for Apple Silicon optimization
- Image Processing: PIL for image handling and resizing
The application captures webcam frames and sends them to SmolVLM for analysis. The AI provides detailed descriptions of what it sees, including objects, people, activities, and scenes.
- Gradient backgrounds and modern CSS
- Responsive design for different screen sizes
- Real-time status indicators
- Smooth animations and transitions
- Adjustable AI parameters (temperature, max tokens)
- Custom prompts for specific use cases
- Auto-analysis for continuous monitoring
Scene Description:
"I can see a person sitting at a desk with a laptop computer. There are books and papers scattered on the desk, and a window with natural lighting in the background. The person appears to be working or studying."
Object Detection:
"In this image, I can identify: a laptop computer, several books, a coffee mug, a desk lamp, and a potted plant on the windowsill."
Feel free to submit issues, feature requests, or pull requests to improve this application.
This project is open source. Please check individual dependencies for their respective licenses.
- SmolVLM: HuggingFace's efficient vision language model
- MLX: Apple's machine learning framework for Apple Silicon
- MLX-VLM: MLX integration for vision language models
- Inspired by: https://github.com/ngxson/smolvlm-realtime-webcam
Enjoy analyzing the world through AI! 🎉