Tag: multimodal
- Video-ChatGPT Towards Detailed Video Understanding via Large Vision and Language Models (26 Sep 2023)
- VideoChat Chat-Centric Video Understanding (25 Sep 2023)
- MaMMUT A Simple Architecture for Joint Learning for MultiModal Tasks (24 Sep 2023)
- Scaling Vision Transformers (23 Sep 2023)
- Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (22 Sep 2023)
- An Empirical Study of Training End-to-End Vision-and-Language Transformers (21 Sep 2023)
- NExT-GPT Any-to-Any Multimodal LLM (16 Sep 2023)
- Mobile V-MoEs Scaling Down Vision Transformers via Sparse Mixture-of-Experts (14 Sep 2023)
- InstructDiffusion A Generalist Modeling Interface for Vision Tasks (10 Sep 2023)
- SeamlessM4T-Massively Multilingual & Multimodal Machine Translation (05 Sep 2023)
- SeamlessM4T-Massively Multilingual & Multimodal Machine Translation (04 Sep 2023)
- Multimodal Learning with Transformers A Survey (02 Sep 2023)
- MovieChat From Dense Token to Sparse Memory for Long Video Understanding (30 Aug 2023)
- Unified Model for Image, Video, Audio and Language Tasks (16 Aug 2023)
- Link-Context Learning for Multimodal LLMs (13 Aug 2023)
- AVIS Autonomous Visual Information Seeking with Large Language Models (12 Aug 2023)
- MusicLM Generating Music From Text (09 Aug 2023)
- SimVLM Simple Visual Language Model Pretraining with Weak Supervision (07 Aug 2023)
- InternVideo General Video Foundation Models via Generative and Discriminative Learning (06 Aug 2023)
- Image as a Foreign Language BEiT Pretraining for All Vision and Vision-Language Tasks (05 Aug 2023)
- BLIP-2 Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (04 Aug 2023)
- DualToken-ViT Position-aware Efficient Vision Transformer with Dual Token Fusion (03 Aug 2023)