1998 ApproximateNearestNeighbors

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Nearest Neighbor Search Task, Curse of Dimensionality.

Notes

Cited By

Quotes

Abstract

The nearest neighbor problem is the following: Given a set of [math]\displaystyle{ n }[/math] points P={p_1,...,p_n} in some metric space [math]\displaystyle{ X }[/math], preprocess P/ so as to efficiently answer queries which require finding the point in [math]\displaystyle{ P }[/math] closest to a query point q ∈ X. We focus on the particularly interesting case of the [math]\displaystyle{ d }[/math]-dimensional Euclidean space where X = R^d under some l_p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large [math]\displaystyle{ d }[/math], in theory or in practice, they provide little improvement over the brute-force algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p ∈ P that is an e-approximate nearest neighbor of the query [math]\displaystyle{ q }[/math] in that for all p’ ∈ P, d(p, q) < (1 + e)d(p’, q).

We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in [math]\displaystyle{ n }[/math] and [math]\displaystyle{ d }[/math], and a truly sublinear query time (for [math]\displaystyle{ e }[/math] > 1); and, (b) query time polynomial in log-n and [math]\displaystyle{ d }[/math], and only a mildly exponential preprocessing cost Õ(n) × O(1/e^d). Further, applying a classical geometric lemma on random projections (for which we give a simpler proof), we obtain the first known algorithm with polynomial preprocessing and query time polynomial in [math]\displaystyle{ d }[/math] and log n. Unfortunately, for small [math]\displaystyle{ e }[/math], the latter is a purely theoretical result since the exponent depends on l/e. Experimental results indicate that our first algorithm offers orders of magnitude improvement on running times over real data sets. Its key ingredient is the notion of locality-sensitive hashing which may be of independent interest; here, we give applications to information retrieval, pattern recognition, dynamic closest-pairs, and fast clustering algorithms.


References

  • 1 Pankaj K. Agarwal, Jirí Matoušek, Ray shooting and parametric search, Proceedings of the twenty-fourth annual ACM symposium on Theory of computing, p.517-526, May 04-06, 1992, Victoria, British Columbia, Canada  doi:10.1145/129712.129763
  • 2 Sunil Arya, David M. Mount, Approximate nearest neighbor queries in fixed dimensions, Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, p.271-280, January 25-27, 1993, Austin, Texas, United States
  • 3 13, Arya, D,M, Mount, and O. Narayan, Accounting for boundary effects in nearest-neighbor searching. Discrete and Computational Geometry, 16(1996):155-176.
  • 4 Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, Angela Wu, An optimal algorithm for approximate nearest neighbor searching, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.573-582, January 23-25, 1994, Arlington, Virginia, United States
  • 5 A. Andersson, Static Dictionaries on RAMs: Query Time is Necessary and Sufficient, Proceedings of the 37th Annual Symposium on Foundations of Computer Science, p.441, October 14-16, 1996
  • 6 Jon Louis Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM, v.18 n.9, p.509-517, Sept. 1975  doi:10.1145/361002.361007
  • 7 Marshall Bern, Approximate closest-point queries in high dimensions, Information Processing Letters, v.45 n.2, p.95-99, Feb. 26, 1993  doi:10.1016/0020-0190(93)90222-U
  • 8 M. W. Berry, S. T. D %A, A. T. Shippy, A Case Study of Latent Semantic Indexing, University of Tennessee, Knoxville, TN, 1995
  • 9 Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth International Conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
  • 10 C. Buddey, A. Singhal, M. Mitra, and Gerard M. Salton. New Retrieval Approaches Using SMART: TREC 4. In: Pro. ceedings of the Fourth Text Retrieval Conference, National Institute of Standards and Technology, 1995.
  • 11 W. A. Burkhard, R. M. Keller, Some approaches to best-match file searching, Communications of the ACM, v.16 n.4, p.230-236, April 1973  doi:10.1145/362003.362025
  • 12 Tolga Bozkaya, Meral Ozsoyoglu, Distance-based indexing for high-dimensional metric spaces, Proceedings of the 1997 ACM SIGMOD Conference, p.357-368, May 11-15, 1997, Tucson, Arizona, United States
  • 13 F. Cazals, Effective nearest neighbors searching on the hyper-cube, with applications to molecular clustering, Proceedings of the fourteenth annual symposium on Computational geometry, p.222-230, June 07-10, 1998, Minneapolis, Minnesota, United States  doi:10.1145/276884.276910
  • 14 Timothy M. Chan, Approximate nearest neighbor queries revisited, Proceedings of the thirteenth annual symposium on Computational geometry, p.352-358, June 04-06, 1997, Nice, France  doi:10.1145/262839.263001
  • 15 Kenneth L. Clarkson. (1988). “A Randomized Algorithm for Closest-Point Queries.” In: SIAM Journal on Computing, 17(4). doi:10.1137/0217052
  • 16 Kenneth L. Clarkson. (1994). “An Algorithm for Approximate Closest-Point Queries.” In: Proceedings of the tenth annual symposium on Computational Geometry. doi:10.1145/177424.177609.
  • 17 Kenneth L. Clarkson. (1997). “Nearest neighbor queries in metric spaces.” In: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. doi:10.1145/258533.258655
  • 18 Scott Cost, Steven Salzberg, A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features, Machine Learning, v.10 n.1, p.57-78, Jan. 1993  doi:10.1023/A:1022664626993
  • 19 T.M. Cover and P.E. Hart, Nearest neighbor pattern classification. {EEE Transactions on Information The. ory, 13(1967):21-27.
  • 20 $. Deerwester, S. T. Dumais, T.K. Landaner, G.W. Furhas, and R.A. Harshman. Indexing by latent semantic analysis. Journal of the Society fo~ Information Sci. ences, 41(1990):391-407.
  • 21 L. Devroye and T.J. Wagner, Nearest neighbor methods in discrimination. In: Handbook of Statistics, vol. 2, P.R. Krishnaiah and L.N. Kanal, eds., North-Holland, 1982.
  • 22 D. Dobkin and R. Lipton. Multidimensional search problems. SIAM Journal on Computing, 5(1976):181- 186.
  • 23 Danny Dolev, Yuval Harari, Nathan Linial, Noam Nisan, Michal Parnas, Neighborhood preserving hashing and approximate queries, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.251-259, January 23-25, 1994, Arlington, Virginia, United States
  • 24 D. Dolev, Y. Harari, and M. Parnas. Finding the neighborhood of a query in a dictionary. In: Proceedings of the ~nd Israel Symposium on Theory and Computing Systems, 1993, pp. 33-42.
  • 25 Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification (2nd Edition), Wiley-Interscience, 2000
  • 26 Herbert Edelsbrunner. (1987). “Algorithms in Combinatorial Geometry. Springer-Verlag
  • 27 David Eppstein, Fast hiearchical clustering and other applications of dynamic closet pairs, Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, p.619-628, January 25-27, 1998, San Francisco, California, United States
  • 28 Christos Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, W. Equitz, Efficient and effective querying by image content, Journal of Intelligent Information Systems, v.3 n.3-4, p.231-262, July 1994  doi:10.1007/BF00962238
  • 29 W. Feller. An introduction to Probability Theory and its Applications. John Wiley & Sons, NY, 1991.
  • 30 Myron Flickner, Harpreet Sawhney, Wayne Niblack, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, Peter Yanker, Query by Image and Video Content: The QBIC System, Computer, v.28 n.9, p.23-32, September 1995  doi:10.1109/2.410146
  • 31 William B. Frakes, Ricardo Baeza-Yates, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992
  • 32 P. Frankl, H. Maehara, The Johnson-Lindenstrauss Lemma and the sphericity of some graphs, Journal of Combinatorial Theory Series A, v.44 n.3, p.355-362, June 1987
  • 33 Michael L. Fredman, János Komlós, Endre Szemerédi, Storing a Sparse Table with 0(1) Worst Case Access Time, Journal of the ACM (JACM), v.31 n.3, p.538-544, July 1984  doi:10.1145/828.1884
  • 34 Jerome H. Freidman, Jon Louis Bentley, Raphael Ari Finkel, An Algorithm for Finding Best Matches in Logarithmic Expected Time, ACM Transactions on Mathematical Software (TOMS), v.3 n.3, p.209-226, Sept. 1977  doi:10.1145/355744.355745
  • 35 Allen Gersho, Robert M. Gray, Vector quantization and signal compression, Kluwer Academic Publishers, Norwell, MA, 1991
  • 36 A. Gionis, P. Indyk, and Rajeev Motwani. Similarity Search in High Dimensions via Hashing. Manuscript, 1997.
  • 37 D. Greene, M. Parnas, and F. Yao. Multi-index hashing for information retrieval. In: Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science, 1994, pp. 722-731.
  • 38 Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classification. In: First InternaSonal Conference on Knowledge Discovery f.4 Data Mining, 1995, pp. 142-149.
  • 39 H. HoteUing. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 27( 1933):417-441.
  • 40 P. Indyk, Rajeev Motwani, and S. Venkatasubramanian. Geometric Matching Under Noise- Combinatorial Bounds and Algorithms. Manuscript, 1997.
  • 41 W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert space. Contemporary Mathematics, 26(1984):189-206.
  • 42 W.B. Jotmson and G. Schechtman. Embedding l~~ into l{'. Acta Mathematica, 149(1982):71-85.
  • 43 K. Karhunen. Uber lineare Methoden in dew Wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae, Set. A137, 1947.
  • 44 V. Koivune and S. Kassam. Nearest neighbor filters for multivariate data. IEEE l'~rkshop on Nonlinear Signal and Image Processing, 1995.
  • 45 Jon Kleinberg, Two algorithms for nearest-neighbor search in high dimensions, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.599-608, May 04-06, 1997, El Paso, Texas, United States  doi:10.1145/258533.258653
  • 46 Eyal Kushilevitz, Rafail Ostrovsky, Yuval Rabani, Efficient search for approximate nearest neighbor in high dimensional spaces, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.614-623, May 24-26, 1998, Dallas, Texas, United States  doi:10.1145/276698.276877
  • 47 R. M. Karp, O. Waarts, G. Zweig, The bit vector intersection problem, Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS'95), p.621, October 23-25, 1995
  • 48 N. Linial, E. London, and Y. Rabinovich. The geomctry of graphs and some of its algorithmic appllcation~. In: Proceedings of 85th Annual IEEE Symposium on Foundations of Computer Science, 1994, pp. 577-591.
  • 49 M. Loire. Fonctions aleastoires de second ordere. Processus Stochastiques et mouvcment Brownian, Hermann, Paris, 1948.
  • 50 Jirí Matoušek, Reporting points in halfspaces, Computational Geometry: Theory and Applications, v.2 n.3, p.169-186, Nov. 1992  doi:10.1016/0925-7721(92)90006-E
  • 51 S. Meiser, Point location in arrangements of hyperplanes, Information and Computation, v.106 n.2, p.286-303, Oct. 1993  doi:10.1006/inco.1993.1057
  • 52 Nimrod Megiddo, Applying Parallel Computation Algorithms in the Design of Serial Algorithms, Journal of the ACM (JACM), v.30 n.4, p.852-865, Oct. 1983  doi:10.1145/2157.322410
  • 53 M. Minsky and S. Papert. Perceptrona. MIT Prcs~, Cambridge, MA, 1969.
  • 54 Rajeev Motwani, Prabhakar Raghavan, Randomized algorithms, Cambridge University Press, New York, NY, 1995
  • 55 A. Pentland, R.W. Picard, and S. Sdaroff. Photobook: tools for content-based manipulation of image databases. In: Proceedings of the SPiE Confercncc on Storage and Retrieval of Image and Vidco Databases II, 1994.
  • 56 G. Pisier. The volume of convex bodies and Banach space geometry. Cambridge Universit4' Press, 1989.
  • 57 Gerard M. Salton, Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, 1986
  • 58 Hanan Samet, The design and analysis of spatial data structures, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1990
  • 59 Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos, Multidimensional Access Methods: Trees Have Grown Everywhere, Proceedings of the 23rd International Conference on Very Large Data Bases, p.13-14, August 25-29, 1997
  • 60 A.W.M. Smeulders and R. JaJn, eds. Image Databaaca and Multi-media Search. Proceedings of the First International Workshop, IDB-MIvIS '96, Amsterdam Unlversky Press, Amsterdam, 1996.
  • 61 J.K. Uhlm~. Satisfying General Pro.,dmity/Similarit.y Queries with Metric Trees. Information ProccaMng Letters, 40(1991):175-179.
  • 62 Peter N. Yianilos, Data structures and algorithms for nearest neighbor search in general metric spaces, Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, p.311-321, January 25-27, 1993, Austin, Texas, United States
  • 63 T. A. Welch, BOUNDS ON INFORMATION RETRIEVAL EFFICIENCY IN STATIC FILE STRUCTURES., Massachusetts Institute of Technology, Cambridge, MA, 1971
  • 64 A C Yao, F F Yao, A general approach to d-dimensional geometric queries, Proceedings of the seventeenth annual ACM symposium on Theory of computing, p.163-168, May 06-08, 1985, Providence, Rhode Island, United States  doi:10.1145/22145.22163,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1998 ApproximateNearestNeighborsRajeev Motwani
Piotr Indyk
Approximate Nearest Neighbors: Towards removing the curse of dimensionalityProceedings of the Thirtieth Annual ACM Symposium on Theory of Computinghttp://dx.doi.org/10.1145/276698.27687610.1145/276698.2768761998