Locality-sensitive Analysis For Rank-based Similarity Search

Michael Houle
National Institute of Informatics: Tokyo, Japan


The similarity search problem, also known as nearest-neighbour search problem,is defined as follows: given a collection of data objects, build a data structure which, for a given query object, returns a list of data objects most similar to the query. Similarity search is an operation of fundamental importance in many application areas, including data mining, information retrieval, and multimedia databases. The effectiveness of known methods for similarity search is generally limited due a phenomenon known as the "curse of dimensionality", whereby the query or preprocessing time (or both) depends exponentially on the number of features used to represent the objects. When the number of features is large (as is often the case in practical applications), the query cost approaches that of a sequential search through the entire data collection. In recent years, a number of similarity search strategies have been proposed in an attempt to mitigate the effects of the curse of dimensionality, either by trading the exactness of the query result for improvements in query processing time, or by relating query performance to measures of the so-called "intrinsic" dimensionality of the data set, or both In this presentation, we will survey several of these strategies - locality sensitive hashing (LSH), the cover tree, and the SASH hierarchical search structure - and introduce a new similarity search structure, the rank cover tree (RCT), that blends the characteristics of all three approaches.