Visual Question Answering - A Survey of Methods and Datasets

Notes for “Visual Question Answering: A Survey of Methods and Datasets”

Method	pros	cons
Joint embedding approaches	straightforward in their principle and constitute the base of most current approaches to VQA	too basic and can be improved with the addition of other tech
Attention mechanisms	address this issue by using local image features and translates readily to the task of VQA for focusing on image regions relevant to the question	closer inspection by question type show little or no beneﬁt on binary (yes/no) questions
Compositional Models	a better use of supervision and facilitates transfer learning	For NMN, the assembly of modules uses aggressive simpliﬁcation of the questions that discards some grammatical cues. For DMN, a potential criticism for applying a same model to text and images stems from the intrinsically diﬀerent nature of sequences of words, and sequences of image patches.
Models using external knowledge bases	it shows an advantage in using the external KB in terms of average accuracy	It’s hard to find a proper way to linking knowledge database and M=most existing VQA datasets include a majority of questions that require little amount of prior knowledge, and performance on these datasets thus poorly reﬂect the particular capabilities of these methods

Database	Features
DAQUAR	795 training and 654 test images, two types of question/answer pairs are collected: synthetic and human annotators. The main disadvantage of DAQUAR is the restriction of answers to a predeﬁned set of 16 colors and 894 object categories. The dataset also presents strong biases showing that humans tend to focus on a few prominent objects, such as tables and chairs
COCO-QA	the anwers are synthetic, four types based on the type of expected answer: object, number, color, and location, A side-eﬀect of the automatic conversion of captions is a high repetition rate of the questions. Among the 38,948 questions of the test set, 9,072 (23.29%) of them also appear as training questions.
FM-IQA	the questions/answers are provided here by humans, answering the questions typically requires both understanding the visual contents of the image and incorporating prior “common sense” information
VQA-real	it contains 614,163 questions, each having 10 answers from 10 diﬀerent annotators, little more than purely visual information is required to answer most questions, 5.5% of the questions were estimated to require adult-level knowledge
Visual Genome	the largest available dataset for VQA, with 1.7 million question/answer pairs, as shown by the top-1000 most frequent answers only covering about 64% of the correct answers
Visual7w	a subset of the Visual Genome that contains additional annotations, each question being provided with 4 candidate answers, of which only one is correct, all the objects mentioned in the questions are visually grounded
Visual Madlibs	The objective is to determine words to complete an afﬁrmation that describes a given image, the dataset comprises 10,738 images from COCO [45] and 360,001 focused natural language descriptions