Qiang Zhang

Experienced Computer Vision and Machine Learning Engineer

Blog Deep Learning Archives Investment World History Graph of Philosophers About

Llemma An Open Language Model For Mathematics

This is my reading note for Llemma: An Open Language Model For Mathematics. This paper proposes to continue the training of code llama on math dataset to improve its performance on math problem.

Read More

Scaling Autoregressive Multi-Modal Models Pretraining and Instruction Tuning

This is my reading note for Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning. This paper proposes a method for text to image generation which is NOT based on diffusion. It utilizes auto-regressive model on tokens.

Read More

Multi-head or Single-head? An Empirical Comparison for Transformer Training

This is my reading note for Multi-head or Single-head? An Empirical Comparison for Transformer Training. This paper shows that multi head attention is the same as deeper single head attention, but the later is more direct to train and need special care to initialize.

Read More

Grounding Visual Illusions in Language Do Vision-Language Models Perceive Illusions Like Humans?

This is my reading note for Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?. This paper shows that larger model though more powerful, also more vulnerable to vision illusion as human does.

Read More

What Does BERT Look At? An Analysis of BERT's Attention

This is my reading note for What Does BERT Look At? An Analysis of BERT’s Attention. This paper studies the attention map of Bert.it found that the attention map captures information such as syntax and co reference . It also found there is a lot of redundancy in heads of the same layer.

Read More

Localizing and Editing Knowledge in Text-to-Image Generative Models

This is my reading note for Localizing and Editing Knowledge in Text-to-Image Generative Models. This paper studied how each component of diffusion model contribute to the final result: only that self attention layer of last tokens contribute to the final result. Then it proposes a simple method to perform image editing by modifying that layer.

Read More

An Image is Worth Multiple Words Learning Object Level Concepts using Multi-Concept Prompt Learning

This is my reading note for An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning. This paper proposes a method to learn embedding of multiple concepts for diffusion model, to this ends, it leverages masking in embed and contrast loss.

Read More

Language Is Not All You Need Aligning Perception with Language Models

This is my reading note for Language Is Not All You Need: Aligning Perception with Language Models. This paper proposes a multimodal LLM which feeds the visual signal as a sequence of embedding, then combines with text embedding and trains in a GPT like way.

Read More

UNITER UNiversal Image-TExt Representation Learning

This is my reading note for UNITER: UNiversal Image-TExt Representation Learning. This paper proposes a vision language pre training model. The major innovation here is it studies the work region alignment loss as well as different mask region models task.

Read More

SEED-Bench Benchmarking Multimodal LLMs with Generative Comprehension

This is my reading note for SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. This paper proposes a benchmark suite of modality LLM. It introduces how is the data created and how is the task derived. For evaluation, it utilizes the model’s output of likelihood of answers instead of directly on text answers.

Read More
Previous Page: 15 of 34 Next