Notes for “Visual Question Answering: A Survey of Methods and Datasets”
Method | pros | cons |
---|---|---|
Joint embedding approaches | straightforward in their principle and constitute the base of most current approaches to VQA | too basic and can be improved with the addition of other tech |
Attention mechanisms | address this issue by using local image features and translates readily to the task of VQA for focusing on image regions relevant to the question | closer inspection by question type show little or no benefit on binary (yes/no) questions |
Compositional Models | a better use of supervision and facilitates transfer learning | For NMN, the assembly of modules uses aggressive simplification of the questions that discards some grammatical cues. For DMN, a potential criticism for applying a same model to text and images stems from the intrinsically different nature of sequences of words, and sequences of image patches. |
Models using external knowledge bases | it shows an advantage in using the external KB in terms of average accuracy | It’s hard to find a proper way to linking knowledge database and M=most existing VQA datasets include a majority of questions that require little amount of prior knowledge, and performance on these datasets thus poorly reflect the particular capabilities of these methods |
Database | Features |
---|---|
DAQUAR | 795 training and 654 test images, two types of question/answer pairs are collected: synthetic and human annotators. The main disadvantage of DAQUAR is the restriction of answers to a predefined set of 16 colors and 894 object categories. The dataset also presents strong biases showing that humans tend to focus on a few prominent objects, such as tables and chairs |
COCO-QA | the anwers are synthetic, four types based on the type of expected answer: object, number, color, and location, A side-effect of the automatic conversion of captions is a high repetition rate of the questions. Among the 38,948 questions of the test set, 9,072 (23.29%) of them also appear as training questions. |
FM-IQA | the questions/answers are provided here by humans, answering the questions typically requires both understanding the visual contents of the image and incorporating prior “common sense” information |
VQA-real | it contains 614,163 questions, each having 10 answers from 10 different annotators, little more than purely visual information is required to answer most questions, 5.5% of the questions were estimated to require adult-level knowledge |
Visual Genome | the largest available dataset for VQA, with 1.7 million question/answer pairs, as shown by the top-1000 most frequent answers only covering about 64% of the correct answers |
Visual7w | a subset of the Visual Genome that contains additional annotations, each question being provided with 4 candidate answers, of which only one is correct, all the objects mentioned in the questions are visually grounded |
Visual Madlibs | The objective is to determine words to complete an affirmation that describes a given image, the dataset comprises 10,738 images from COCO [45] and 360,001 focused natural language descriptions |