Implementation of Vision Transformer, a simple way to achieve SOTA
Build Vision Agents quickly with any model or video provider
Phi-3.5 for Mac: Locally-run Vision and Language Models
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
Structure-from-Motion and Multi-View Stereo
The repository provides code for running inference with SAM 2
ICLR2024 Spotlight: curation/training code, metadata, distribution
"Big Model" trains a visual multimodal VLM with 26M parameters
A fast, powerful, and simple hierarchical vision transformer
Reference PyTorch implementation and models for DINOv3
Towards Real-World Vision-Language Understanding
Large-language-model & vision-language-model based on Linear Attention
This repository contains the official implementation of FastVLM
Provides code for running inference with the SegmentAnything Model
NVIDIA Isaac GR00T N1.5 is the world's first open foundation model
A neural network that transforms a design mock-up into static websites
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Please do not feed the models
Sample code and notebooks for Generative AI on Google Cloud
[CVPR 2025 Best Paper Award] VGGT
PyTorch code and models for the DINOv2 self-supervised learning
High-resolution models for human tasks
GLM-4.6V/4.5V/4.1V-Thinking, towards versatile multimodal reasoning
Unified Multimodal Understanding and Generation Models
4M: Massively Multimodal Masked Modeling