Visual Question Answering - A Survey of Methods and Datasets

Notes for “Visual Question Answering: A Survey of Methods and Datasets”

Method pros cons
Joint embedding approaches straightforward in their principle and constitute the base of most current approaches to VQA too basic and can be improved with the addition of other tech
Attention mechanisms address this issue by using local image features and translates readily to the task of VQA for focusing on image regions relevant to the question closer inspection by question type show little or no benefit on binary (yes/no) questions
Compositional Models a better use of supervision and facilitates transfer learning For NMN, the assembly of modules uses aggressive simplification of the questions that discards some grammatical cues. For DMN, a potential criticism for applying a same model to text and images stems from the intrinsically different nature of sequences of words, and sequences of image patches.
Models using external knowledge bases it shows an advantage in using the external KB in terms of average accuracy It’s hard to find a proper way to linking knowledge database and M=most existing VQA datasets include a majority of questions that require little amount of prior knowledge, and performance on these datasets thus poorly reflect the particular capabilities of these methods
Database Features
DAQUAR 795 training and 654 test images, two types of question/answer pairs are collected: synthetic and human annotators. The main disadvantage of DAQUAR is the restriction of answers to a predefined set of 16 colors and 894 object categories. The dataset also presents strong biases showing that humans tend to focus on a few prominent objects, such as tables and chairs
COCO-QA the anwers are synthetic, four types based on the type of expected answer: object, number, color, and location, A side-effect of the automatic conversion of captions is a high repetition rate of the questions. Among the 38,948 questions of the test set, 9,072 (23.29%) of them also appear as training questions.
FM-IQA the questions/answers are provided here by humans, answering the questions typically requires both understanding the visual contents of the image and incorporating prior “common sense” information
VQA-real it contains 614,163 questions, each having 10 answers from 10 different annotators, little more than purely visual information is required to answer most questions, 5.5% of the questions were estimated to require adult-level knowledge
Visual Genome the largest available dataset for VQA, with 1.7 million question/answer pairs, as shown by the top-1000 most frequent answers only covering about 64% of the correct answers
Visual7w a subset of the Visual Genome that contains additional annotations, each question being provided with 4 candidate answers, of which only one is correct, all the objects mentioned in the questions are visually grounded
Visual Madlibs The objective is to determine words to complete an affirmation that describes a given image, the dataset comprises 10,738 images from COCO [45] and 360,001 focused natural language descriptions