Inferring Intent from Pointing with Computer Vision

Hinthorn, William

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01v405sd112

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Russakovsky, Olga	-
dc.contributor.author	Hinthorn, William	-
dc.date.accessioned	2018-08-14T15:58:43Z	-
dc.date.available	2018-08-14T15:58:43Z	-
dc.date.created	2018-05-04	-
dc.date.issued	2018-08-14	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp01v405sd112	-
dc.description.abstract	Over the past five years, Convolutional Neural Networks (CNNs) and massive benchmark datasets have pushed the field of computer vision (CV) to new heights. Current models can segment images according to semantic class with great accuracy. As visual artificial intelligence (AI) becomes integrated in our daily lives, the need arises for models to better understand how humans refer to objects. They must see beyond the explicit class or classes that verily could be used to label an entity and understand the intent implicit in the specific localization of a visual query, selecting the label that most likely matches the human intent given the visual context. Insufficient research has been devoted towards building CV systems that model the joint attention between humans and machines. In this thesis, I propose an object-part inference task to improve CV’s abilities to reason about the nuanced task of human pointing. In the process of developing this task, I make three specific contributions to the goal of building human-like AI in the process of developing this task. First, I have annotated a dataset of points distributed over 15 of the object classes of the Pascal VOC Parts Challenge dataset. Each point is annotated as most likely referring to the entire object, a part of the object, neither of the above, or whether the point is located in a position such that one may not clearly infer the intent of the pointer. My second contribution is a statistical analysis of this dataset to examine existing biases and other insights into the complexities of the object-part inference task. My third contribution is the design of computer vision models that infer human intent given a point on an image. I report 81.5% accuracy on the object-part inference task when conditioned on the semantic object class, with 67.3% accuracy achieved without any additional semantic information. Finally, I extend the task to predict the spatial extent of the object or part indicated by a point and obtain a mIoU of 48.80% over the validation set. Note that the semantic segmentation mIoU of the simple model used for these scores is a mere 68.28%, well below the state-of the-art on Pascal VOC. Using deeper, more powerful base networks would greatly improve overall accuracy on the object-part task.	en_US
dc.format.mimetype	application/pdf	-
dc.language.iso	en	en_US
dc.title	Inferring Intent from Pointing with Computer Vision	en_US
dc.type	Princeton University Senior Theses	-
pu.date.classyear	2018	en_US
pu.department	Computer Science	en_US
pu.pdf.coverpage	SeniorThesisCoverPage	-
pu.contributor.authorid	960833141	-
pu.certificate	Center for Statistics and Machine Learning	en_US
Appears in Collections:	Computer Science, 1988-2020

Files in This Item:

File	Description	Size	Format
HINTHORN-WILLIAM-THESIS.pdf		3.98 MB	Adobe PDF	Request a copy

Show simple item record

Search

Browse