Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01nz806254t
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Russakovsky, Olga | - |
dc.contributor.author | McCaffrey, Ryan | - |
dc.date.accessioned | 2019-09-04T17:47:22Z | - |
dc.date.available | 2019-09-04T17:47:22Z | - |
dc.date.created | 2019-05-06 | - |
dc.date.issued | 2019-09-04 | - |
dc.identifier.uri | http://arks.princeton.edu/ark:/88435/dsp01nz806254t | - |
dc.description.abstract | Over the last two years, computer vision (CV) research has put more focus on the task of video moment localization using natural language queries: given a free-form language description of an action and an untrimmed video clip, the goal is to retrieve the moment in the video that best corresponds to the given action description. A current challenge in the task, and in CV more generally, is being able to understand action descriptions that are either unseen or seen infrequently during training time. A long tail of complex action descriptions offers an opportunity for compositional analysis of the underlying structures of these descriptions, but existing architectures lack the logic to perform this analysis. Instead, while current architectures in video moment localization attempt to leverage language structures, they do so in ways that statically prevent the models from adapting to the varying structures of the input language. In this work, I provide three major contributions. I first show that current approaches towards the video moment retrieval task fail to correctly understand actions that are unseen or seen infrequently during training time. My second contribution takes an existing dataset for video moment localization, the Distinct Describable Moments (DiDeMo) dataset, and offers a new zero-shot split that highlights existing models’ deteriorated performances in zero-shot environments. Finally, I look to build a multimodal attention network that adapts to the dynamic structures of given language queries, and I show competitive performance on the DiDeMo and zero-shot DiDeMo datasets when compared to existing state-of-the-art models. | en_US |
dc.format.mimetype | application/pdf | - |
dc.language.iso | en | en_US |
dc.title | Toward Zero-Shot Action Recognition for Video Moment Localization | en_US |
dc.type | Princeton University Senior Theses | - |
pu.date.classyear | 2019 | en_US |
pu.department | Computer Science | en_US |
pu.pdf.coverpage | SeniorThesisCoverPage | - |
pu.contributor.authorid | 961177843 | - |
Appears in Collections: | Computer Science, 1988-2020 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
MCCAFFREY-RYAN-THESIS.pdf | 1.74 MB | Adobe PDF | Request a copy |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.