Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01pc289j087
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorLi, Kaien_US
dc.contributor.authorDong, Weien_US
dc.contributor.otherComputer Science Departmenten_US
dc.date.accessioned2011-11-18T14:39:19Z-
dc.date.available2011-11-18T14:39:19Z-
dc.date.issued2011en_US
dc.identifier.urihttp://arks.princeton.edu/ark:/88435/dsp01pc289j087-
dc.description.abstractThe volume of images and other non-text data is growing exponentially in today's digital universe. A popular way of extracting useful information from such data is to conduct content-based similarity search, e.g., to search for images with content similar to a query image. How to build data structures to support efficient similarity search on a large scale is an issue of increasing importance. The challenge is that feature-rich data are usually represented as high-dimensional feature vectors, and the curse of dimensionality dictates that as dimensionality grows, any search strategy examines an increasingly large portion of the dataset and eventually degenerates to brute-force scan. In this dissertation, we study several key issues to improve the accuracy and efficiency of high-dimensional similarity search. Specifically, we make the following contributions: First, we enhance Multi-Probe LSH, the state-of-the-art high-dimensional similarity search technique, by developing an accurate performance model that predicts search accuracy with an error rate of less than 5% using only a 1/10 sample of the whole dataset. We use this model to realize automatic performance tuning, which would otherwise be tedious, and to develop an adaptive query processing strategy to ensure the high search quality of each individual query. Second, we develop an efficient offline method for simultaneously finding the K nearest neighbors of every point in the dataset. Our method is 2 to 16 times faster than the state-of-the-art technique when evaluated with different datasets. Furthermore, we show how to use the offline computed nearest-neighbor information to double the speed of online similarity search. Third, we develop compact representations of high-dimensional feature vectors optimized for similarity search tasks. We present an asymmetric distance estimation framework to exploit the information in the uncompressed query data. We use that to further compress the indexed dataset, by 10% to 40%, without compromising search quality. Fourth, we develop a scheme to compactly represent sets of feature vectors, an increasingly popular data representation that is more accurate than single vectors, but also more expensive. Our method dramatically reduces the matching cost in both space and time. In an image classification task, our method achieves 25 times speedup over one of the best existing techniques. Finally, we demonstrate some of the proposed techniques by building a large-scale near-duplicate image search engine. Our system serves more than 50 million web images on a single commodity server and responds to queries within a few seconds.en_US
dc.language.isoenen_US
dc.publisherPrinceton, NJ : Princeton Universityen_US
dc.relation.isformatofThe Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the <a href=http://catalog.princeton.edu> library's main catalog </a>en_US
dc.subjectk-nearest neighboren_US
dc.subjectlocality-sensitive hashingen_US
dc.subjectnear-duplicateen_US
dc.subjectsketchen_US
dc.subject.classificationComputer scienceen_US
dc.titleHigh-Dimensional Similarity Search for Large Datasetsen_US
dc.typeAcademic dissertations (Ph.D.)en_US
pu.projectgrantnumber690-2143en_US
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Dong_princeton_0181D_10018.pdf1.56 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.