MVSNeRF Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo

This is my reading note for MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo. It first build a cost volume at the reference view (we refer to the view i = 1 as the reference view) by warping 2D neural features onto multiple sweeping planes (Sec. 3.1). It then leverage a 3D CNN to reconstruct the neural encoding volume, and use an MLP to regress volume rendering properties, expressing a radiance field (Sec. 3.2). It leverage differentiable ray marching to regress images at novel viewpoints using the radiance field modeled by the network; this enables end-to-end training of our entire framework with a rendering loss (Sec. 3.3)

SimVLM Simple Visual Language Model Pretraining with Weak Supervision

This is my reading note for SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective

InternVideo General Video Foundation Models via Generative and Discriminative Learning

This is my reading note for InternVideo: General Video Foundation Models via Generative and Discriminative Learning. This paper propose to train a multi-modality model for video by utilizes both masked video prediction and contrast loss. However, this paper uses a encoder-decoder for masked video prediction and the other video encoder for contrast loss

Image as a Foreign Language BEiT Pretraining for All Vision and Vision-Language Tasks

This is my reading note for Image as a Foreign Language BEiT Pretraining for All Vision and Vision-Language Tasks. The paper proposes a multi modality model which models image data as foreign language and propose only to use masked language models as the pre-train tasks.

BLIP-2 Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

This is my reading note for BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. The paper propose Q former to align the visual feature to text feature. Both visual feature and text feature are extracted from fixed models. Q former learned query and output the visual embeds to the text space.

DualToken-ViT Position-aware Efficient Vision Transformer with Dual Token Fusion

This is my reading note for DualToken-ViT Position-aware Efficient Vision Transformer with Dual Token Fusion. The paper discuss efficient transformer, which is based on combining convolution with attention: where convolution extracts local information and then fused with global information via attention.

Visual Instruction Tuning

This is my reading note for Visual Instruction Tuning. The paper exposes a method to train a multi-modality model - that woks like chat GPT. This is achieved by building an instruction following dataset that’s paired with images. The model is then trained on this dataset.

CoCa Contrastive Captioners are Image-Text Foundation Models

This is my reading note for CoCa: Contrastive Captioners are Image-Text Foundation Models. The paper proposes a multi modality model, especially it models the problem as image caption as well as text alignment problem. The model contains three component: a vision encoder, a text decoder (which generates text embedding ) and a multi modality decoder , which generate caption given image and text embedding.

FLAVA A Foundational Language And Vision Alignment Model

This is my reading note for FLAVA: A Foundational Language And Vision Alignment Model. This paper proposes a multi modality model. Especially, the model not only work across modality, but also on each modality and joint modality. To do that, it contains loss functions for both within modality but also across modality. It also proposes to use the same architecture for vision encoder, Text encoder as well as multi -modality encoder.

AutoCLIP Auto-tuning Zero-Shot Classifiers for Vision-Language Models

This is my reading note for AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models. This paper proposes a method to use clip for zero shot image classification, to do that, it first generates several prompt to convert class label to text embedding by average. Then the image is processed by visual encoder. The label of image is the one has slowest distance between label embody and image embedding. This paper propose to use soft Max instead of average for label embedding.