An Early Evaluation of GPT-4V(ision)

This is my reading note for An Early Evaluation of GPT-4V(ision). The highlights of our findings are as follows:

  1. GPT-4V exhibits impressive performance on English visual-centric benchmarks but fails to recognize simple Chinese texts in the images;
  2. GPT-4V shows inconsistent refusal behavior when answering questions related to sensitive traits such as gender, race, and age;
  3. GPT-4V obtains worse results than GPT-4 (API) on language understanding tasks including general language understanding benchmarks and visual commonsense knowledge evaluation benchmarks;
  4. Few-shot prompting can improve GPT-4V’s performance on both visual understanding and language understanding;
  5. GPT-4V struggles to find the nuances between two similar images and solve the easy math picture puzzles;
  6. GPT-4V shows non-trivial performance on the tasks of similar modalities to image, such as video and thermal. O (p. 1)
Read More

Vision Transformers Need Registers

This is my reading note for Vision Transformers Need Registers. This paper analyzes the attention map of transformer and find too large scale transformer and trained after a long iteration, some token show exceptionally high norm. Those tokens usually correspond to patches in uniform background. Analysis indicates that those tokens are used to store global information. Thus at would heart dense prediction tasks like image segmentation. To tackle this, the paper proposes add additional tokens during trains and inference, but rejecting for outputs.

Read More