High-Dimensional Similarity Search for Large Datasets

Dong, Wei

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01pc289j087

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Li, Kai	en_US
dc.contributor.author	Dong, Wei	en_US
dc.contributor.other	Computer Science Department	en_US
dc.date.accessioned	2011-11-18T14:39:19Z	-
dc.date.available	2011-11-18T14:39:19Z	-
dc.date.issued	2011	en_US
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp01pc289j087	-
dc.description.abstract	The volume of images and other non-text data is growing exponentially in today's digital universe. A popular way of extracting useful information from such data is to conduct content-based similarity search, e.g., to search for images with content similar to a query image. How to build data structures to support efficient similarity search on a large scale is an issue of increasing importance. The challenge is that feature-rich data are usually represented as high-dimensional feature vectors, and the curse of dimensionality dictates that as dimensionality grows, any search strategy examines an increasingly large portion of the dataset and eventually degenerates to brute-force scan. In this dissertation, we study several key issues to improve the accuracy and efficiency of high-dimensional similarity search. Specifically, we make the following contributions: First, we enhance Multi-Probe LSH, the state-of-the-art high-dimensional similarity search technique, by developing an accurate performance model that predicts search accuracy with an error rate of less than 5% using only a 1/10 sample of the whole dataset. We use this model to realize automatic performance tuning, which would otherwise be tedious, and to develop an adaptive query processing strategy to ensure the high search quality of each individual query. Second, we develop an efficient offline method for simultaneously finding the K nearest neighbors of every point in the dataset. Our method is 2 to 16 times faster than the state-of-the-art technique when evaluated with different datasets. Furthermore, we show how to use the offline computed nearest-neighbor information to double the speed of online similarity search. Third, we develop compact representations of high-dimensional feature vectors optimized for similarity search tasks. We present an asymmetric distance estimation framework to exploit the information in the uncompressed query data. We use that to further compress the indexed dataset, by 10% to 40%, without compromising search quality. Fourth, we develop a scheme to compactly represent sets of feature vectors, an increasingly popular data representation that is more accurate than single vectors, but also more expensive. Our method dramatically reduces the matching cost in both space and time. In an image classification task, our method achieves 25 times speedup over one of the best existing techniques. Finally, we demonstrate some of the proposed techniques by building a large-scale near-duplicate image search engine. Our system serves more than 50 million web images on a single commodity server and responds to queries within a few seconds.	en_US
dc.language.iso	en	en_US
dc.publisher	Princeton, NJ : Princeton University	en_US
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the <a href=http://catalog.princeton.edu> library's main catalog </a>	en_US
dc.subject	k-nearest neighbor	en_US
dc.subject	locality-sensitive hashing	en_US
dc.subject	near-duplicate	en_US
dc.subject	sketch	en_US
dc.subject.classification	Computer science	en_US
dc.title	High-Dimensional Similarity Search for Large Datasets	en_US
dc.type	Academic dissertations (Ph.D.)	en_US
pu.projectgrantnumber	690-2143	en_US
Appears in Collections:	Computer Science

Files in This Item:

File	Description	Size	Format
Dong_princeton_0181D_10018.pdf		1.56 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse