Xiong Wang
Department of Computer Science
California State University, Fullerton
wang@ecs.fullerton.edu
Dennis Shasha
Department of Computer Science
Courant Institute of Mathematical Sciences
New York University
shasha@cs.nyu.edu
Kaizhong Zhang
Department of Computer Science
University of Western Ontario, Canada
kzhang@csd.uwo.ca
There are four tools in this package, which is part of the ASES project. Click here for the details of the project.
1. The MetricMap system:
There are 2 header files:
There are 14 source files:
angle1.c, | ind1.c, | lud1.c, | ortho1.c, | qli1.c, |
tri1.c, | uti1.c, | eigw.c, | fbilinw.c, | fchow.c, |
fheadw.c, | prow.c, | repeatw.c, | project.c |
Executable files: femb and proj
Sample input data files: d400.20 and coxrhi.dist
"d400.20" contains the pairwise distances of 400 randomly
generated 20 dimensional vectors, which form 4 clusters.
"coxrhi.dist" contains the pairwise distances of 200 RNA
secondary structures, where 100 RNA secondary structures are pertaining to the
human rhinovirus and the other 100 RNA secondary structures are pertaining to
the coxsackievirus. Thus, these RNA secondary structures roughly form 2
clusters, one corresponding to the human rhinovirus and the other corresponding
to the coxsackievirus. These sample data files may be used to replace the
parameter Distance_in_file in the commands listed below.
To compile:
To run:
"femb" establishes a Dimen dimensional pseudo-Euclidean space based on the sampling data set.
It generates 3 files: b.ind, b.val, and gram.mat.
b.ind contains the indices of the objects that are chosen to be the base of the space.
b.val contains the eigenvalues of the bilinear form that correspond to the chosen objects.
gram.mat is the gram matrix that is used to project objects to the pseudo-Euclidean space.
"proj" projects the data objects to the pseudo-Euclidean space.
Distance_in_file contains the pairwise distances of data objects,
Sample_Size is the size of the sample set,
Dimen is the dimensionality of the target space,
Nobj is the total number of data objects,
Vect_out_file contains the image vectors of the objects
in the pseudo-Euclidean space, and
Dist-out-file contains the pairwise distances of the image vectors.
We give an annotated example of a session of the MetricMap system.
We assume there is a file d400.20 which contains the pairwise distances of
4 clusters of randomly generated vectors in 20 dimensional Euclidean space.
There are 100 vectors in each cluster.
The file looks like the following:
3.5331 3.3376 2.8868 3.3015 3.3660 3.5378 4.0033 3.8468 3.6920 4.3646 3.2924 3.2653 3.8241 3.5259 4.2307 3.9440 3.3932 4.3023 3.9482 4.0962 3.5242 3.8446 3.5387 4.4129 3.7491 3.3720 3.1809 3.1663 3.7694 4.0969 3.2754 3.7708 4.4286 3.3644 4.7370 4.0737 3.6367 3.6421 3.6069 2.8791 4.0440 4.4357 4.0226 3.7571 4.2270 ....... .......
We index the objects, starting from 0.
Strictly speaking, the pairwise distances of N objects is an NxN matrix.
Since the matrix is symmetric, i.e. d(i,j)=
d(j,i), and d(i,i)= 0 for all i, we
store only that part which is below the diagonal line of the matrix, i.e. j < i.
In the above table, an entry at the ith row and jth column represents the distance between object i and object j-1.
So, "3.5331" represents the distance between object 1 and object 0.
"3.3660" represents the distance between object 3 and object 1.
We list objects based on clusters.
Thus cluster 1 contains objects
with indexes ranging from 0 to 99;
cluster 2 contains objects with indexes ranging from 100 to 199, etc.
To map these vectors to a 10 dimensional pseudo-Euclidean space, assume a sampling set of 20 vectors (here we assume a random sampling).
We first run:
The file b.ind looks like the following
9 25 36 28 10 37 14 31 39 32
This means that object 9, object 25, ..., object 32 are chosen as the reference objects.
These reference objects are used to build the target space.
The file b.val looks like the following:
244.295985815851878 30.840193003299831 27.331134046091876 22.693324721447667 21.848601030670395 19.968715757559004 17.760763670010231 15.492304704396824 13.657936946677481 11.905276204308640
Here b.val contains the eigenvalues corresponding to the reference objects.
For example, "244.295985815851878" is the eigenvalue corresponding to object 9; "30.840193003299831" is the eigenvalue corresponding to
object 25, and so on.
The file gram.mat looks like the following:
0.115574453235548 -0.051501465876904 -0.045854384778602 -0.073070673365977 -0.009605172001804 -0.112828982820657 0.088720024289652 -0.099648201210417 -0.083911615757602 -0.119962106364417 0.130328716802373 -0.342840761030932 0.303066694308480 0.339452617555608 -0.381437909747193 -0.405149430555989 0.207206955370795 0.007917643375473 0.073360446342078 -0.272648797992660 0.161125014894303 -0.081818856765714 -0.090949331640256 -0.382668529714077 0.362174216869933 0.011536665348883 0.352833178734573 0.037090889714356 -0.143451180700862 -0.173681276125141 0.517326453798527 0.143421668173576 -0.333934112168199 -0.695769740733675 0.710188476084372 0.183824076960314 0.447244519144627 -0.086988258971959 -0.528231413528893 0.206261348445111 -0.216989958513510 -0.056761272336272 0.054511518044404 0.357906030906817 0.039417076254325 0.047611730984748 -0.042798387337480 -0.161467640610151 -0.063010615954813 0.042618497834656 -0.122629060402885 -0.170704841438197 0.272538934066024 0.106233770116270 0.066302544980609 0.054218174179443 -0.265046095403686 0.136857710174841 -0.077837156218892 -0.089320609615473 -0.160263159664019 -0.045435732313140 -0.204640412776988 0.097376699480695 0.263829774762633 -0.015872668103288 -0.429466876738876 0.389463467793250 -0.056196631019379 -0.040543029498501 0.349560943260916 0.193815613793358 -0.146025738215632 0.165859971660814 -0.168572992954508 0.044216920551938 -0.187463800068824 -0.193002937790857 0.253110146859819 -0.431229731847577 0.225181054224089 0.075061105215083 -0.171809659314470 -0.277591329025099 0.227801981190053 -0.104634551945476 0.055533353905625 -0.228939392394629 0.219356238431454 0.057341962635193 0.068159572817717 -0.235779559410695 -0.290390424430416 -0.083056342690215 0.058481591986641 0.355855377089496 0.003283594117529 0.008180944815393 0.008041381644906 0.176734980138826
Here, gram.mat is the Gram matrix used in the transformation when embedding the objects in the pseudo-Euclidean space.
We then run:
Here we want to project 400 objects, whose pairwise distances are stored in file d400.20, to a 10 dimensional pseudo-Euclidean space. The output includes the vectors (d400.vect), which are the images of those objects in the pseudo-Euclidean space, and the pairwise distances (d400.dist) of those vectors.
The output in the file d400.vect looks like the following:
400 10 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 -2.5781 -2.1680 0.5737 2.0081 -0.7672 -0.0987 0.2544 -0.3053 0.3483 0.0392 -2.3080 0.8428 0.4520 0.8193 -0.8753 0.0016 0.4910 -0.0696 -0.2953 -0.2928 -1.6848 -0.4318 1.1193 2.0989 0.6211 0.3638 0.1002 0.5503 0.7714 0.5975 -3.1964 -0.4338 0.3476 -1.4289 -1.5407 0.6868 1.0198 0.1987 0.1276 -0.3175 -1.8273 0.0457 -1.1576 0.3282 -0.1685 -0.7129 -1.0777 0.7811 0.1593 0.9413 -2.3284 -0.5072 -0.0256 -0.8463 0.4821 0.2637 -1.6375 1.1964 0.4367 -1.2949 -3.2492 -0.9738 -1.5265 -0.6101 0.5196 0.1557 -0.3079 -0.2439 0.1784 -0.6968 -1.9128 1.9896 -1.4410 1.2827 -0.9333 0.2214 -0.0460 2.0425 0.8316 0.2288 -2.6038 -0.7197 1.5676 1.9001 1.4665 0.6101 1.0496 -0.5040 1.4032 -0.1418 -3.4464 -0.2469 0.9019 -0.8877 0.1510 0.9234 0.4323 0.5043 -0.2450 0.1873 -1.6865 1.5494 -0.4617 -0.9208 0.8724 -0.1325 0.8989 -0.7223 0.2492 -0.1926 -1.5631 -0.2084 0.6216 -0.7920 0.3213 -0.2442 0.3770 0.2211 0.5378 0.0152 -2.3364 1.1241 2.0811 0.4693 0.6360 -1.7910 -1.2878 -0.5327 -0.5601 -0.2646 ...... ......
The number 400 indicates that there are 400 vectors.
The number 10 indicates that these are 10 dimensional vectors.
Each line of this file represents the coordinates of an
image vector in the pseudo-Euclidean space.
The first line represents image vector 0,
the second line represents image vector 1, and so on.
The output in the file d400.dist looks like the following:
4.0728 2.8456 3.3502 3.2352 2.7398 2.9889 4.0755 4.1815 2.9879 4.7249 2.8299 3.9779 3.1020 3.5049 4.1980 3.5606 4.5433 3.7960 4.1910 3.8231 3.1871 3.8953 3.9691 3.6289 4.4674 3.3331 3.1036 2.7665 4.1445 5.3213 3.4076 4.2380 4.8977 3.1386 4.5469 4.7804 4.3856 3.2260 3.7689 2.2141 4.9199 4.9404 4.8649 4.6019 5.5956 3.8667 3.9858 2.8405 3.7353 2.1089 3.7809 3.1858 3.0554 4.7936 3.8987 ...... ......
This file contains the pairwise distances of the images. As in d400.20,the entry at the ith row and jth column represents the distance between image vector i (i.e. the image of object i) and image vector j-1 (i.e. the image of object j-1). Again, the vectors are indexed starting from 0, consistent with the object indexing.
So, "4.0728" represents the distance between the image of object 1 and
the image of object 0.
"2.7398" represents the distance between the image of object 3 and the image of object 1.
2. The clustering program Average Group Method:
Average Group Method is a clustering algorithm described in many textbooks, e.g. in L. Kaufman and P. J. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster Analysis", John Wiley & Sons, 1990. Its implementation is as follows.
There is one header file: agm.h
There are three source files: agm.c, agmuti.c, and distance.c
Executable file: agm
To compile:
To run:
Distance_in_file contains the pairwise distances between the objects, N_obj is the total number of the objects, and N_cluster is the number of clusters to be generated.
For example, if we run:
The output looks like the following:
Everything is ready. Initialize heap complete... Clustering complete... Cluster 0: 17 63 45 44 71 83 90 58 69 19 86 95 97 93 80 20 87 47 46 25 10 60 15 79 68 59 64 14 82 92 96 55 33 75 11 91 70 66 37 31 99 21 65 5 34 74 29 73 12 26 51 0 72 57 36 28 9 56 16 4 89 38 50 7 40 23 41 3 67 8 98 39 22 54 43 53 24 2 30 49 52 81 42 76 94 78 32 1 85 6 84 61 77 35 88 13 48 27 18 62 Total: 100 objects. Cluster 1: 286 267 278 244 258 237 251 249 261 266 247 252 257 250 210 227 287 220 289 211 282 285 299 213 268 225 214 294 206 216 224 212 231 245 273 219 238 246 276 232 242 201 270 236 275 204 202 290 207 240 291 215 239 274 203 283 218 281 298 228 205 229 272 292 253 262 241 293 254 264 295 265 284 280 260 223 243 222 248 234 256 200 259 297 230 208 233 217 209 235 255 288 263 279 226 296 269 271 277 221 Total: 100 objects. Cluster 2: 194 196 154 164 173 174 119 159 108 198 121 175 134 120 163 141 191 179 115 195 116 104 185 144 127 131 107 186 189 106 172 155 183 113 101 103 187 123 133 184 128 130 124 111 180 136 150 142 129 138 132 197 158 166 188 117 190 199 167 153 161 140 160 126 114 145 102 125 151 139 182 157 135 137 181 193 118 165 192 156 110 147 146 176 178 112 122 168 149 169 148 152 162 170 177 143 100 109 171 105 Total: 100 objects. Cluster 3: 310 366 378 372 312 346 368 356 380 342 333 348 381 377 302 303 397 391 338 349 341 327 384 313 326 357 371 392 329 367 389 300 374 330 361 385 344 395 322 364 321 324 353 360 396 339 387 332 369 373 307 315 376 323 386 309 388 347 317 336 345 328 383 306 314 399 340 352 375 354 308 394 318 370 390 304 301 359 393 334 351 365 358 320 325 362 379 305 335 316 350 343 355 382 331 319 337 363 311 398 Total: 100 objects.
In each cluster we list object indexes. The data are generated in such
a way that the first cluster includes objects with indexes ranging
from 0 to 99, the second cluster includes objects with indexes ranging
from 100 to 199, the third cluster includes objects with indexes
ranging from 200 to 299, and the fourth cluster includes objects with
indexes ranging from 300 to 399.
We can see from the outputs that AGM achieves perfect clustering.
Average Group Method has the complexity of O(NxN).
Assuming the calculation of distances between the objects is expensive,
applying Average Group Method directly to the data set is not feasible.
Thus we can take a small sample set of m objects and use MetricMap to
map the objects to vectors in the pseudo-Euclidean space in time O(Nxm).
Once the data objects are embedded in the target space, the calculation
of distances between the vectors is much cheaper.
We then apply Average Group Method to the pairwise distances of the vectors
in the pseudo-Euclidean space and cluster them.
For example, to apply Average Group Method directly to the RNA input
data file coxrhi.dist, we execute the following command
at the operating system level:
The output on the screen looks like the following:
Everything is ready. Initialize heap complete... Clustering complete... Cluster 0: 180 116 196 146 155 156 170 183 181 163 108 144 141 167 186 122 145 195 138 177 134 149 123 199 179 190 128 178 119 113 118 131 160 191 162 135 100 150 130 126 173 154 102 104 151 153 148 182 111 109 169 143 172 176 114 107 165 106 194 133 193 136 166 101 192 157 142 137 159 132 103 139 124 129 188 105 171 112 115 174 175 127 187 120 185 189 168 198 110 121 117 184 147 152 164 Total: 95 objects. Cluster 1: 17 18 70 68 26 86 20 22 9 43 29 62 58 64 33 65 76 75 60 27 72 30 32 28 87 97 93 5 4 46 92 99 50 91 90 42 81 3 7 16 55 80 88 45 37 71 73 39 66 23 13 14 56 89 10 52 19 12 69 54 25 47 40 67 53 36 41 35 84 78 49 95 21 15 77 82 1 83 51 8 6 0 48 98 94 96 11 2 57 79 63 61 31 24 59 161 197 140 158 125 44 85 38 74 34 Total: 105 objects.
Suppose we want to take a sample set of 120 RNA secondary structures, some coming from the human rhinovirus family and the other coming from the coxsackievirus family. Map the RNA secondary structures to a 80 dimensional space, and then use Average Group Method to cluster their image vectors in the pseudo-Euclidean space.
We can do this by typing the following commands at the operating system level:
The output on the screen looks like the following:
Everything is ready. Initialize heap complete... Clustering complete... Cluster 0: 115 112 139 188 171 124 129 105 103 168 186 162 116 167 130 169 156 182 101 106 178 111 114 128 109 113 104 108 107 149 123 100 199 135 177 193 134 102 163 131 194 183 184 190 148 144 192 151 180 179 143 146 126 133 150 166 191 138 160 136 189 195 125 187 127 120 145 165 175 140 196 164 137 159 158 161 142 172 157 197 181 173 185 118 119 154 153 132 176 170 155 141 122 174 147 152 110 198 117 121 Total: 100 objects. Cluster 1: 36 41 35 67 53 47 40 77 15 21 59 31 6 48 57 82 2 8 51 83 96 79 63 61 95 1 11 49 0 24 78 84 98 94 44 74 38 85 34 27 76 75 65 60 30 26 68 70 18 17 20 22 86 32 28 87 9 29 72 43 33 62 37 71 73 66 23 39 45 88 42 3 80 7 16 81 55 90 69 54 12 4 50 92 99 91 97 93 46 5 56 13 14 64 58 52 19 89 10 25 Total: 100 objects.
3. The clustering program CURE:
CURE is published in ACM SIGMOD'98 by S. Guha, R. Rastogi, and K. Shim.
Its implementation is as follows.
There is one header file: cure.h
There are three source files: cure.c, cureuti.c, and distance.c
Executable file: cure
To compile:
To run:
Vector_file contains the vectors to be clustered, N_vect is the number of vectors to be clustered, and N_cluster is the number of clusters to be generated.
For example, the file "v400.20" contains 400 randomly generated 20
dimensional vectors.
The file looks like the following:
400 20 -0.1620 0.7580 -0.8870 0.5150 0.0510 0.6270 0.0100 0.4190 -0.7880 -0.9140 -0.2510 -0.2330 0.084 0 -0.9400 -0.7750 0.5430 0.0890 0.1830 0.1370 0.5660 -0.0340 -0.0220 -0.5050 -0.6890 0.3670 -0.9460 0.0310 0.1450 0.8820 0.7360 -0.4760 -0.4950 -0.6 060 -0.8980 -0.1490 0.0670 -0.2460 0.6530 -0.4390 0.0960 0.6280 0.1880 -0.9150 -0.8570 -0.0330 0.4060 -0.8350 0.4030 0.5620 -0.1660 0.3530 -0.0800 -0.55 60 -0.1970 0.9620 0.3180 0.4220 0.3270 -0.5430 0.9450 -0.5210 0.9830 -0.2490 0.8940 -0.3300 -0.7410 -0.7520 0.7570 0.6290 0.3060 -0.3940 0.9900 0.738 0 -0.4840 0.4140 0.2620 0.1160 -0.1750 0.1810 0.1340 0.3430 -0.9780 0.2330 0.5360 0.7600 0.9790 0.0710 0.2010 0.3360 0.0610 -0.8400 -0.9950 -0.2710 0.6440 0.4750 0.6930 0.5140 -0.8610 -0.9120 -0.4790 0.2020 0.1710 -0.5660 -0.6830 -0.4180 -0.1850 -0.4140 0.6530 -0.6940 0.1740 -0.5490 0.4480 -0.5 270 -0.5660 -0.8070 -0.8900 -0.2520 -0.7900 0.3200 -0.9510 -0.0440 -0.8380 -0.8340 -0.0030 0.7930 -0.6900 0.3910 0.7990 0.9260 -0.0950 0.8850 -0.4180 0.61 ...... ......
The number 400 indicates that there are 400 vectors.
The number 20 indicates that these are 20 dimensional vectors.
To apply CURE to v400.20 we run:
The output on the screen looks like the following:
Everything is ready. Initialize heap complete... Clustering complete... Cluster 0: 263 286 267 251 249 222 202 289 299 294 219 274 203 230 262 258 254 253 233 293 237 250 232 226 287 220 255 207 291 223 239 271 206 259 296 269 245 231 240 236 276 290 213 297 208 209 217 270 205 229 200 201 242 234 256 248 225 268 292 235 288 278 244 252 261 238 246 212 224 277 221 285 210 243 275 216 265 295 228 215 218 281 257 204 298 211 227 282 279 264 272 241 266 284 280 247 214 260 283 273 Total: 100 objects. Cluster 1: 196 150 147 198 166 185 108 161 102 151 174 194 125 103 176 130 162 128 126 120 178 146 139 135 157 160 177 199 190 148 152 170 107 186 119 189 106 172 155 113 173 101 100 149 121 175 115 191 129 138 114 134 136 180 158 197 111 124 143 118 193 183 104 127 144 179 171 105 109 112 122 165 156 110 192 154 164 184 159 133 131 168 141 181 153 167 163 132 169 195 116 182 137 117 142 188 140 187 123 145 Total: 100 objects. Cluster 2: 27 85 30 66 31 54 58 10 9 3 11 24 53 36 21 76 94 17 37 71 0 56 47 22 15 95 20 87 73 68 59 78 32 1 91 7 70 74 34 5 65 33 75 79 18 42 81 52 45 98 40 50 44 62 93 84 61 41 25 46 48 77 72 57 38 4 16 29 12 26 2 99 97 83 64 14 82 92 28 96 55 86 89 19 69 39 23 60 90 35 6 63 80 49 88 13 67 8 43 51 Total: 100 objects. Cluster 3: 302 318 326 352 397 300 346 395 341 391 303 337 363 379 366 310 329 362 325 374 305 353 385 321 367 347 335 323 317 376 307 332 373 312 351 365 349 334 399 340 336 306 350 370 328 393 304 343 301 359 342 390 375 345 316 330 364 354 394 378 348 333 320 339 396 387 388 386 309 356 322 389 398 308 371 313 319 382 355 361 377 381 392 357 372 324 315 369 383 368 327 384 380 314 358 338 331 360 311 344 Total: 100 objects.
Like AGM, CURE achieves a perfect clustering on these data.
We may map these 20 dimensional vectors to a 5 dimensional
pseudo-Euclidean space by using MetricMap
through the following commands:
Notice that d400.20 contains the pairwise distances of the vectors in v400.20. The data in d400.vect now looks like the following:
400 5 0.0000 0.0000 0.0000 0.0000 0.0000 -2.9230 -0.1298 1.3652 -0.7308 0.9252 -2.6179 -0.3876 1.0235 -0.1215 -0.4779 -1.9882 0.5153 0.2969 0.0072 -0.4473 -2.5031 -1.5796 -0.5920 -1.5827 0.5528 -2.0708 1.2086 0.6483 -1.1900 -0.4187 -2.7634 0.2079 1.0544 0.8281 0.9377 -2.9396 0.4679 -0.1456 -0.2773 0.1866 -2.5543 1.1021 -0.5163 -0.5840 -0.2894 -2.7844 -0.1489 -0.1641 2.1055 0.0639 ...... ......
We may then apply CURE to the 5-dimensional vectors in d400.vect by executing the following command at the operating system level:
The output on the screen looks like the following:
Everything is ready. Initialize heap complete... Clustering complete... Cluster 0: 220 285 224 211 286 287 204 202 290 299 27 203 282 267 244 270 229 263 276 248 246 228 298 279 232 231 219 200 225 259 241 206 294 222 218 288 271 245 262 281 235 280 215 264 250 239 269 247 293 261 255 252 266 205 240 236 230 212 207 292 242 257 295 291 265 213 217 209 201 278 233 223 260 216 254 289 284 214 238 275 296 283 268 253 256 234 208 297 258 249 221 277 237 272 210 226 243 251 274 227 273 Total: 101 objects. Cluster 1: 51 57 58 74 5 65 11 10 28 52 29 12 56 41 25 68 6 46 87 60 88 9 92 38 76 81 42 82 44 79 62 71 77 35 84 34 61 98 40 50 8 85 3 7 15 66 99 90 86 69 72 24 33 70 20 64 39 16 53 32 1 47 26 37 73 55 2 22 91 36 13 95 14 75 48 59 83 67 96 54 43 30 49 17 97 21 19 80 45 18 93 4 89 0 31 23 94 63 Total: 98 objects. Cluster 2: 353 324 385 362 355 382 309 386 351 305 389 301 343 354 341 333 342 381 307 379 364 374 390 388 316 350 373 370 347 329 387 318 345 339 304 360 356 393 321 327 320 348 359 394 378 335 332 310 367 375 380 371 369 361 322 315 308 396 398 323 399 336 358 372 352 303 376 317 344 311 312 330 326 313 338 397 383 328 392 391 306 366 331 395 325 300 340 384 314 334 319 349 365 377 363 346 368 357 302 337 78 Total: 101 objects. Cluster 3: 158 187 122 163 103 154 181 166 108 101 196 175 182 113 106 188 168 185 178 109 105 184 190 130 140 128 161 174 162 170 120 156 189 191 110 132 167 172 124 133 116 138 127 195 123 157 171 126 121 100 135 115 102 177 104 143 160 129 150 119 192 179 139 176 107 148 180 149 151 142 183 144 125 117 169 173 155 199 114 194 111 136 186 197 141 134 152 153 118 193 159 131 112 137 145 147 165 146 198 164 Total: 100 objects.
Notice that two objects, object 27 and 78, are mis-clustered.
The mis-clustered rate is 0.5%.
CURE is a vector based clustering algorithm, which can not be applied to
general distance metric directly.
We used MetricMap to map objects to vectors so that CURE can be used to cluster those objects.
4. The best match search and range search program using MetricMap:
There are no header files.
There are 4 source files: distance.c, vector.c, uti1.c, and msearch.c.
Executable file: msearch
To compile:
To run:
or
Given a target object and a database of objects, "msearch" performs best
match retrieval with the "-b" option, or epsilon range search with
the "-r" option.
For the best match retrieval, the msearch tool finds the object in the
database that is closest to the target.
If there are more than one best match, the tool returns all the
best-matching objects.
For the epsilon range search, the msearch tool finds the objects in the
database whose distances to the target are within the given epsilon value.
Here we assume all the database objects have been embedded in a
pseudo-Euclidean space using MetricMap.
The indexes of the reference objects are stored in a file called "b.ind"
(cf. the MetricMap tool).
The image vectors of the objects are stored in a file called "v.base".
We give two annotated examples of sessions of the msearch system, one for best match retrieval and the other for epsilon range search.
Refer to the MetricMap system.
Suppose we have a database of 400 20-dimensional vectors, indexed from 0
to 399. These vectors (objects) are stored in file v400.20.
The pairwise distances of the objects are stored in file d400.20.
We can embed those objects in a 10-dimensional pseudo-Eucliden space by
executing the following commands:
Recall that the "femb" program generates two files, "b.ind" and "v.base". The file "b.ind" contains the indexes of the reference objects, which looks like the following:
9 25 36 28 10 37 14 31 39 32
This means that object 9, object 25, ... object 32 are used as reference objects to build the target space.
The file "v.base" contains the image vectors of the objects, which looks like the following:
400 10 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 -2.5781 -2.1680 0.5737 2.0081 -0.7672 -0.0987 0.2544 -0.3053 0.3483 0.0392 -2.3080 0.8428 0.4520 0.8193 -0.8753 0.0016 0.4910 -0.0696 -0.2953 -0.2928 -1.6848 -0.4318 1.1193 2.0989 0.6211 0.3638 0.1002 0.5503 0.7714 0.5975 -3.1964 -0.4338 0.3476 -1.4289 -1.5407 0.6868 1.0198 0.1987 0.1276 -0.3175 -1.8273 0.0457 -1.1576 0.3282 -0.1685 -0.7129 -1.0777 0.7811 0.1593 0.9413 -2.3284 -0.5072 -0.0256 -0.8463 0.4821 0.2637 -1.6375 1.1964 0.4367 -1.2949 -3.2492 -0.9738 -1.5265 -0.6101 0.5196 0.1557 -0.3079 -0.2439 0.1784 -0.6968 -1.9128 1.9896 -1.4410 1.2827 -0.9333 0.2214 -0.0460 2.0425 0.8316 0.2288 -2.6038 -0.7197 1.5676 1.9001 1.4665 0.6101 1.0496 -0.5040 1.4032 -0.1418 -3.4464 -0.2469 0.9019 -0.8877 0.1510 0.9234 0.4323 0.5043 -0.2450 0.1873 -1.6865 1.5494 -0.4617 -0.9208 0.8724 -0.1325 0.8989 -0.7223 0.2492 -0.1926 -1.5631 -0.2084 0.6216 -0.7920 0.3213 -0.2442 0.3770 0.2211 0.5378 0.0152 -2.3364 1.1241 2.0811 0.4693 0.6360 -1.7910 -1.2878 -0.5327 -0.5601 -0.2646 ...... ......
The number 400 indicates that there are 400 vectors.
The number 10 indicates that these are 10-dimensional vectors.
Each line of this file represents the coordinates of an image
vector in the pseudo-Euclidean space.
Now suppose we are given a target object in a file called "target".
For example, the data in the target file looks like the following:
-2.9375 -1.8045 -2.4635 -2.8335 -1.5975 -1.9245 -1.6905 -2.5165 -2.7335 -1.4645 0.9480 -0.0230 0.5400 -0.3580 -0.0180 0.0440 0.7980 -0.7720 0.0490 -0.8840
To find the best match of the target, type the command:
The output on the screen looks like the following:
The nearest neighbor of the target is object with the index 111.
Basically, the msearch tool first computes the distances from the target to the reference objects stored in the file b.ind. Based on these distances, the tool is able to embed the target in the pseudo-Euclidean space. The tool then finds best matches in the pseudo-Euclidean space.
For epsilon range search, suppose we want to find objects that are within distance 2 of the target object stored in the file "target". One may type in the following command:
The output looks like the following:
The data objects that are within distance 2 of the target have indexes 111, 124, 158, 180.
To speed up search in the target space, we have implemented the VA-file technique introduced by R. Webber et. al. in a paper entitled "A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces" which appeared in the Proc. of VLDB'98.