- Pisa, Italy
Highlights
- Pro
Stars
Official repository of the paper "Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation"
(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
[AAAI-25] Cobra: Extending Mamba to Multi-modal Large Language Model for Efficient Inference
[ECCV'24] Official Implementation of Autoregressive Visual Entity Recognizer.
[CBMI2024 Best Paper] Official repository of the paper "Is CLIP the main roadblock for fine-grained open-world perception?".
Hydra is a framework for elegantly configuring complex applications
PyTorch code and models for V-JEPA self-supervised learning from video.
Open-Sora: Democratizing Efficient Video Production for All
✨✨Latest Advances on Multimodal Large Language Models
[ACL 2024] GroundingGPT: Language-Enhanced Multi-modal Grounding Model
Official PyTorch implementation of the paper "TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis" ICCV 2023
[CVPR2024 Highlight] Official repository of the paper "The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding."
showing how to use CLIP-Vip to do video search
Scalable and user friendly neural 🧠 forecasting algorithms.
WildCapture This repository contains the code and dataset used in the paper titled "Leveraging Visual Attention for out-of-distribution Detection" published at ICCV 2023, Paris Out Of Distribution …
the AI-native open-source embedding database
PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs
DSPy: The framework for programming—not prompting—language models
An Evaluation Framework for Temporal Information Extraction Systems
An Image/Text Retrieval Test Collection to Support Multimedia Content Creation
[ECCV2022] A pytorch implementation for TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21]
Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset