An Early Evaluation of GPT-4V(ision)
[multimodal
deep-learning
dataset
gpt-4v
image-caption
vqa
]
This is my reading note for An Early Evaluation of GPT-4V(ision). The highlights of our findings are as follows:
- GPT-4V exhibits impressive performance on English visual-centric benchmarks but fails to recognize simple Chinese texts in the images;
- GPT-4V shows inconsistent refusal behavior when answering questions related to sensitive traits such as gender, race, and age;
- GPT-4V obtains worse results than GPT-4 (API) on language understanding tasks including general language understanding benchmarks and visual commonsense knowledge evaluation benchmarks;
- Few-shot prompting can improve GPT-4V’s performance on both visual understanding and language understanding;
- GPT-4V struggles to find the nuances between two similar images and solve the easy math picture puzzles;
- GPT-4V shows non-trivial performance on the tasks of similar modalities to image, such as video and thermal. O (p. 1)
The current version of GPT-4V does not support interleaved images and texts and can only accept a maximum of four images. These constraints limit the design space of prompts. (p. 2)
Written on October 3, 2023