Tag: transformer
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets.- LongLoRA Efficient Fine-tuning of Long-Context Large Language Models (27 Sep 2023)
- Video-ChatGPT Towards Detailed Video Understanding via Large Vision and Language Models (26 Sep 2023)
- VideoChat Chat-Centric Video Understanding (25 Sep 2023)
- MaMMUT A Simple Architecture for Joint Learning for MultiModal Tasks (24 Sep 2023)
- Scaling Vision Transformers (23 Sep 2023)
- Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (22 Sep 2023)
- An Empirical Study of Training End-to-End Vision-and-Language Transformers (21 Sep 2023)
- NExT-GPT Any-to-Any Multimodal LLM (16 Sep 2023)
- Mobile V-MoEs Scaling Down Vision Transformers via Sparse Mixture-of-Experts (14 Sep 2023)
- SeamlessM4T-Massively Multilingual & Multimodal Machine Translation (05 Sep 2023)
- SeamlessM4T-Massively Multilingual & Multimodal Machine Translation (04 Sep 2023)
- Multimodal Learning with Transformers A Survey (02 Sep 2023)
- MovieChat From Dense Token to Sparse Memory for Long Video Understanding (30 Aug 2023)
- BLIP-Diffusion Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing (21 Aug 2023)
- Unified Model for Image, Video, Audio and Language Tasks (16 Aug 2023)
- ProPainter Improving Propagation and Transformer for Video Inpainting (10 Aug 2023)
- MusicLM Generating Music From Text (09 Aug 2023)
- SimVLM Simple Visual Language Model Pretraining with Weak Supervision (07 Aug 2023)
- InternVideo General Video Foundation Models via Generative and Discriminative Learning (06 Aug 2023)
- Image as a Foreign Language BEiT Pretraining for All Vision and Vision-Language Tasks (05 Aug 2023)
- DualToken-ViT Position-aware Efficient Vision Transformer with Dual Token Fusion (03 Aug 2023)
- Visual Instruction Tuning (02 Aug 2023)
- CoCa Contrastive Captioners are Image-Text Foundation Models (31 Jul 2023)
- FLAVA A Foundational Language And Vision Alignment Model (30 Jul 2023)
- Pix2seq A Language Modeling Framework for Object Detection (28 Sep 2022)
- DreamBooth Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation (28 Sep 2022)
- CLIP Learning Transferable Visual Models From Natural Language Supervision (27 Sep 2022)
- MLP-Mixer An all-MLP Architecture for Vision (08 May 2021)
- Transformer Introduction (14 Apr 2021)
- Swin Transformer (11 Apr 2021)
- CVPR 2021 Transformer Paper (11 Apr 2021)
- ViT AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE (28 Mar 2021)
- End-to-End Object Detection with Transformers (07 Mar 2021)
- Transformer in Computer Vision (03 Feb 2021)