Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01wm117r95s
Title: | license.txt Evaluating Compositionality of Vision and Language Models license.txt license.txt license.txt |
Authors: | McCord, Gregory |
Advisors: | Russakovsky, Olga |
Department: | Computer Science |
Certificate Program: | Center for Statistics and Machine Learning Center for Statistics and Machine Learning Center for Statistics and Machine Learning Center for Statistics and Machine Learning Center for Statistics and Machine Learning Finance Program Engineering and Management Systems Program Center for Statistics and Machine Learning |
Class Year: | 2020 |
Abstract: | While research in computer vision has made many breakthroughs over the past two decades, probing even the most state-of-the-art models reveals glaring shortcomings. For example, the same Visual Question Answering (VQA) model that correctly answers the question "What is the largest measurement on the cup?" might incorrectly respond "yes" to the question "Is there a plate under the elephant?" when an elephant is not even in the picture. Similarly, a top-performing captioning model might incorrectly describe a blue shirt as white or a red flag as black. To address some of these shortcomings, we first propose a vision and language evaluation framework that probes models trained on different tasks to better understand the performance of the models on different detailed concepts. The concepts we analyze are identifying the colors of objects, counting the numbers of objects, and identifying the object types in a scene. We evaluate along each of these dimensions by using distractors (related but incorrect statements) to test along individual dimensions. Further, we also train the models with different levels of supervision along the tested dimension to demonstrate that performance on these tests improves solely with the dimension tested while not inhibiting the models' performance on other dimensions. With this scheme, we quantify the types of queries the models accurately answer and those on which they fail. We develop baselines for our framework using the standard Show and Tell captioning model, Bottom-Up Top-Down captioning model, and Pythia VQA model. Next, we address two reasons for the surprising inconsistencies observed in current VQA models. Namely, responding solely based on the language of the question and putting inappropriate attention on different objects referenced in the question can lead to this undesirable behavior. We generate complex binary questions for VQA models using the scene graphs from the GQA dataset to evaluate how effectively the models are able to align the questions with the corresponding images. Additionally, we test a verification scheme for VQA models to correct for their internal consistency by creating a set of entailed questions and correcting for non-verified answers. On the state-of-the-art Pythia VQA model, our verification improves performance on VQA accuracy from 64.95% to 65.99%. Finally, we take a look at broader extensions of this research into financial applications. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01wm117r95s |
Type of Material: | Princeton University Senior Theses |
Language: | en |
Appears in Collections: | Computer Science, 1988-2020 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
MCCORD-GREGORY-THESIS.pdf | 1.52 MB | Adobe PDF | Request a copy |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.