Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01v405sd112
Title: | Inferring Intent from Pointing with Computer Vision |
Authors: | Hinthorn, William |
Advisors: | Russakovsky, Olga |
Department: | Computer Science |
Certificate Program: | Center for Statistics and Machine Learning |
Class Year: | 2018 |
Abstract: | Over the past five years, Convolutional Neural Networks (CNNs) and massive benchmark datasets have pushed the field of computer vision (CV) to new heights. Current models can segment images according to semantic class with great accuracy. As visual artificial intelligence (AI) becomes integrated in our daily lives, the need arises for models to better understand how humans refer to objects. They must see beyond the explicit class or classes that verily could be used to label an entity and understand the intent implicit in the specific localization of a visual query, selecting the label that most likely matches the human intent given the visual context. Insufficient research has been devoted towards building CV systems that model the joint attention between humans and machines. In this thesis, I propose an object-part inference task to improve CV’s abilities to reason about the nuanced task of human pointing. In the process of developing this task, I make three specific contributions to the goal of building human-like AI in the process of developing this task. First, I have annotated a dataset of points distributed over 15 of the object classes of the Pascal VOC Parts Challenge dataset. Each point is annotated as most likely referring to the entire object, a part of the object, neither of the above, or whether the point is located in a position such that one may not clearly infer the intent of the pointer. My second contribution is a statistical analysis of this dataset to examine existing biases and other insights into the complexities of the object-part inference task. My third contribution is the design of computer vision models that infer human intent given a point on an image. I report 81.5% accuracy on the object-part inference task when conditioned on the semantic object class, with 67.3% accuracy achieved without any additional semantic information. Finally, I extend the task to predict the spatial extent of the object or part indicated by a point and obtain a mIoU of 48.80% over the validation set. Note that the semantic segmentation mIoU of the simple model used for these scores is a mere 68.28%, well below the state-of the-art on Pascal VOC. Using deeper, more powerful base networks would greatly improve overall accuracy on the object-part task. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01v405sd112 |
Type of Material: | Princeton University Senior Theses |
Language: | en |
Appears in Collections: | Computer Science, 1988-2020 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
HINTHORN-WILLIAM-THESIS.pdf | 3.98 MB | Adobe PDF | Request a copy |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.