MetricMap: Searching and Clustering in Metric Spaces

Jason T. L. Wang
Department of Computer Science
College of Computing Sciences
New Jersey Institute of Technology
jason@cis.njit.edu

Xiong Wang
Department of Computer Science
California State University, Fullerton
wang@ecs.fullerton.edu

Dennis Shasha
Department of Computer Science
Courant Institute of Mathematical Sciences
New York University
shasha@cs.nyu.edu

Kaizhong Zhang
Department of Computer Science
University of Western Ontario, Canada
kzhang@csd.uwo.ca

There are four tools in this package, which is part of the ASES project. Click here for the details of the project.

1. The MetricMap system:

There are 2 header files:

nr1.h
main.h

There are 14 source files:

angle1.c, ind1.c, lud1.c, ortho1.c, qli1.c,

tri1.c, uti1.c, eigw.c, fbilinw.c, fchow.c,

fheadw.c, prow.c, repeatw.c, project.c

Executable files: femb and proj

Sample input data files: d400.20 and coxrhi.dist

"d400.20" contains the pairwise distances of 400 randomly generated 20 dimensional vectors, which form 4 clusters.

"coxrhi.dist" contains the pairwise distances of 200 RNA secondary structures, where 100 RNA secondary structures are pertaining to the human rhinovirus and the other 100 RNA secondary structures are pertaining to the coxsackievirus. Thus, these RNA secondary structures roughly form 2 clusters, one corresponding to the human rhinovirus and the other corresponding to the coxsackievirus. These sample data files may be used to replace the parameter Distance_in_file in the commands listed below.

To compile:

make -f Makelib
make femb
cc -o proj project.c -lm

To run:

femb Sample_Size Dimen < Distance_in_file
proj Distance_in_file Nobj Dimen Vect-out-file Dist-out-file

"femb" establishes a Dimen dimensional pseudo-Euclidean space based on the sampling data set.
It generates 3 files: b.ind, b.val, and gram.mat.
b.ind contains the indices of the objects that are chosen to be the base of the space.
b.val contains the eigenvalues of the bilinear form that correspond to the chosen objects.
gram.mat is the gram matrix that is used to project objects to the pseudo-Euclidean space.

"proj" projects the data objects to the pseudo-Euclidean space.
Distance_in_file contains the pairwise distances of data objects,
Sample_Size is the size of the sample set,
Dimen is the dimensionality of the target space,
Nobj is the total number of data objects,
Vect_out_file contains the image vectors of the objects in the pseudo-Euclidean space, and
Dist-out-file contains the pairwise distances of the image vectors.

We give an annotated example of a session of the MetricMap system.
We assume there is a file d400.20 which contains the pairwise distances of 4 clusters of randomly generated vectors in 20 dimensional Euclidean space.
There are 100 vectors in each cluster.
The file looks like the following:

3.5331
3.3376 2.8868
3.3015 3.3660 3.5378
4.0033 3.8468 3.6920 4.3646
3.2924 3.2653 3.8241 3.5259 4.2307
3.9440 3.3932 4.3023 3.9482 4.0962 3.5242
3.8446 3.5387 4.4129 3.7491 3.3720 3.1809 3.1663
3.7694 4.0969 3.2754 3.7708 4.4286 3.3644 4.7370 4.0737
3.6367 3.6421 3.6069 2.8791 4.0440 4.4357 4.0226 3.7571 4.2270
.......
.......

We index the objects, starting from 0.
Strictly speaking, the pairwise distances of N objects is an NxN matrix.
Since the matrix is symmetric, i.e. d(i,j)= d(j,i), and d(i,i)= 0 for all i, we store only that part which is below the diagonal line of the matrix, i.e. j < i.
In the above table, an entry at the ith row and jth column represents the distance between object i and object j-1.
So, "3.5331" represents the distance between object 1 and object 0.
"3.3660" represents the distance between object 3 and object 1.
We list objects based on clusters.
Thus cluster 1 contains objects with indexes ranging from 0 to 99; cluster 2 contains objects with indexes ranging from 100 to 199, etc.

To map these vectors to a 10 dimensional pseudo-Euclidean space, assume a sampling set of 20 vectors (here we assume a random sampling).

We first run:

femb 40 10 < d400.20

The file b.ind looks like the following

9 25 36 28 10 37 14 31 39 32

This means that object 9, object 25, ..., object 32 are chosen as the reference objects.
These reference objects are used to build the target space.
The file b.val looks like the following:

244.295985815851878      30.840193003299831      27.331134046091876     
22.693324721447667
 21.848601030670395      19.968715757559004      17.760763670010231     
15.492304704396824
 13.657936946677481      11.905276204308640

Here b.val contains the eigenvalues corresponding to the reference objects.
For example, "244.295985815851878" is the eigenvalue corresponding to object 9; "30.840193003299831" is the eigenvalue corresponding to object 25, and so on.

The file gram.mat looks like the following:


0.115574453235548 -0.051501465876904 -0.045854384778602 -0.073070673365977
-0.009605172001804
-0.112828982820657  0.088720024289652 -0.099648201210417 -0.083911615757602
-0.119962106364417
 0.130328716802373 -0.342840761030932  0.303066694308480  0.339452617555608
-0.381437909747193
-0.405149430555989  0.207206955370795  0.007917643375473  0.073360446342078
-0.272648797992660
 0.161125014894303 -0.081818856765714 -0.090949331640256 -0.382668529714077 
0.362174216869933
 0.011536665348883  0.352833178734573  0.037090889714356 -0.143451180700862
-0.173681276125141
 0.517326453798527  0.143421668173576 -0.333934112168199 -0.695769740733675 
0.710188476084372
 0.183824076960314  0.447244519144627 -0.086988258971959 -0.528231413528893 
0.206261348445111
-0.216989958513510 -0.056761272336272  0.054511518044404  0.357906030906817 
0.039417076254325
 0.047611730984748 -0.042798387337480 -0.161467640610151 -0.063010615954813 
0.042618497834656
-0.122629060402885 -0.170704841438197  0.272538934066024  0.106233770116270 
0.066302544980609
 0.054218174179443 -0.265046095403686  0.136857710174841 -0.077837156218892
-0.089320609615473
-0.160263159664019 -0.045435732313140 -0.204640412776988  0.097376699480695 
0.263829774762633
-0.015872668103288 -0.429466876738876  0.389463467793250 -0.056196631019379
-0.040543029498501
 0.349560943260916  0.193815613793358 -0.146025738215632  0.165859971660814
-0.168572992954508
 0.044216920551938 -0.187463800068824 -0.193002937790857  0.253110146859819
-0.431229731847577
 0.225181054224089  0.075061105215083 -0.171809659314470 -0.277591329025099 
0.227801981190053
-0.104634551945476  0.055533353905625 -0.228939392394629  0.219356238431454 
0.057341962635193
 0.068159572817717 -0.235779559410695 -0.290390424430416 -0.083056342690215 
0.058481591986641
 0.355855377089496  0.003283594117529  0.008180944815393  0.008041381644906 
0.176734980138826

Here, gram.mat is the Gram matrix used in the transformation when embedding the objects in the pseudo-Euclidean space.

We then run:

proj d400.20 400 10 d400.vect d400.dist

Here we want to project 400 objects, whose pairwise distances are stored in file d400.20, to a 10 dimensional pseudo-Euclidean space. The output includes the vectors (d400.vect), which are the images of those objects in the pseudo-Euclidean space, and the pairwise distances (d400.dist) of those vectors.

The output in the file d400.vect looks like the following:

400 10
  0.0000   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000  
0.0000   0.0000
 -2.5781  -2.1680   0.5737   2.0081  -0.7672  -0.0987   0.2544  -0.3053  
0.3483   0.0392
 -2.3080   0.8428   0.4520   0.8193  -0.8753   0.0016   0.4910  -0.0696 
-0.2953  -0.2928
 -1.6848  -0.4318   1.1193   2.0989   0.6211   0.3638   0.1002   0.5503  
0.7714   0.5975
 -3.1964  -0.4338   0.3476  -1.4289  -1.5407   0.6868   1.0198   0.1987  
0.1276  -0.3175
 -1.8273   0.0457  -1.1576   0.3282  -0.1685  -0.7129  -1.0777   0.7811  
0.1593   0.9413
 -2.3284  -0.5072  -0.0256  -0.8463   0.4821   0.2637  -1.6375   1.1964  
0.4367  -1.2949
 -3.2492  -0.9738  -1.5265  -0.6101   0.5196   0.1557  -0.3079  -0.2439  
0.1784  -0.6968
 -1.9128   1.9896  -1.4410   1.2827  -0.9333   0.2214  -0.0460   2.0425  
0.8316   0.2288
 -2.6038  -0.7197   1.5676   1.9001   1.4665   0.6101   1.0496  -0.5040  
1.4032  -0.1418
 -3.4464  -0.2469   0.9019  -0.8877   0.1510   0.9234   0.4323   0.5043 
-0.2450   0.1873
 -1.6865   1.5494  -0.4617  -0.9208   0.8724  -0.1325   0.8989  -0.7223  
0.2492  -0.1926
 -1.5631  -0.2084   0.6216  -0.7920   0.3213  -0.2442   0.3770   0.2211  
0.5378   0.0152
 -2.3364   1.1241   2.0811   0.4693   0.6360  -1.7910  -1.2878  -0.5327 
-0.5601  -0.2646

......
......

The number 400 indicates that there are 400 vectors.
The number 10 indicates that these are 10 dimensional vectors.
Each line of this file represents the coordinates of an image vector in the pseudo-Euclidean space.
The first line represents image vector 0, the second line represents image vector 1, and so on.
The output in the file d400.dist looks like the following:

  4.0728
  2.8456   3.3502
  3.2352   2.7398   2.9889
  4.0755   4.1815   2.9879   4.7249
  2.8299   3.9779   3.1020   3.5049   4.1980
  3.5606   4.5433   3.7960   4.1910   3.8231   3.1871
  3.8953   3.9691   3.6289   4.4674   3.3331   3.1036   2.7665
  4.1445   5.3213   3.4076   4.2380   4.8977   3.1386   4.5469   4.7804
  4.3856   3.2260   3.7689   2.2141   4.9199   4.9404   4.8649   4.6019  5.5956
  3.8667   3.9858   2.8405   3.7353   2.1089   3.7809   3.1858   3.0554  4.7936   3.8987
  ......
  ......

This file contains the pairwise distances of the images. As in d400.20,the entry at the ith row and jth column represents the distance between image vector i (i.e. the image of object i) and image vector j-1 (i.e. the image of object j-1). Again, the vectors are indexed starting from 0, consistent with the object indexing.

So, "4.0728" represents the distance between the image of object 1 and the image of object 0.
"2.7398" represents the distance between the image of object 3 and the image of object 1.

2. The clustering program Average Group Method:

Average Group Method is a clustering algorithm described in many textbooks, e.g. in L. Kaufman and P. J. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster Analysis", John Wiley & Sons, 1990. Its implementation is as follows.

There is one header file: agm.h

There are three source files: agm.c, agmuti.c, and distance.c

Executable file: agm

To compile:

cc -o agm agm.c

To run:

agm Distance_in_file N_obj N_cluster

Distance_in_file contains the pairwise distances between the objects, N_obj is the total number of the objects, and N_cluster is the number of clusters to be generated.

For example, if we run:

agm d400.20 400 4

The output looks like the following:

Everything is ready.

Initialize heap complete...

Clustering complete...

Cluster 0:

 17  63  45  44  71  83  90  58  69  19  86  95  97  93  80  20  87  47  46 25
 10  60  15  79  68  59  64  14  82  92  96  55  33  75  11  91  70  66  37 31
 99  21  65   5  34  74  29  73  12  26  51   0  72  57  36  28   9  56  16 4
 89  38  50   7  40  23  41   3  67   8  98  39  22  54  43  53  24   2  30 49
 52  81  42  76  94  78  32   1  85   6  84  61  77  35  88  13  48  27  18 62

Total: 100 objects.

Cluster 1:

286 267 278 244 258 237 251 249 261 266 247 252 257 250 210 227 287 220 289 211
282 285 299 213 268 225 214 294 206 216 224 212 231 245 273 219 238 246 276 232
242 201 270 236 275 204 202 290 207 240 291 215 239 274 203 283 218 281 298 228
205 229 272 292 253 262 241 293 254 264 295 265 284 280 260 223 243 222 248 234
256 200 259 297 230 208 233 217 209 235 255 288 263 279 226 296 269 271 277 221

Total: 100 objects.

Cluster 2:

194 196 154 164 173 174 119 159 108 198 121 175 134 120 163 141 191 179 115 195
116 104 185 144 127 131 107 186 189 106 172 155 183 113 101 103 187 123 133 184
128 130 124 111 180 136 150 142 129 138 132 197 158 166 188 117 190 199 167 153
161 140 160 126 114 145 102 125 151 139 182 157 135 137 181 193 118 165 192 156
110 147 146 176 178 112 122 168 149 169 148 152 162 170 177 143 100 109 171 105

Total: 100 objects.

Cluster 3:

310 366 378 372 312 346 368 356 380 342 333 348 381 377 302 303 397 391 338 349
341 327 384 313 326 357 371 392 329 367 389 300 374 330 361 385 344 395 322 364
321 324 353 360 396 339 387 332 369 373 307 315 376 323 386 309 388 347 317 336
345 328 383 306 314 399 340 352 375 354 308 394 318 370 390 304 301 359 393 334
351 365 358 320 325 362 379 305 335 316 350 343 355 382 331 319 337 363 311 398

Total: 100 objects.

In each cluster we list object indexes. The data are generated in such a way that the first cluster includes objects with indexes ranging from 0 to 99, the second cluster includes objects with indexes ranging from 100 to 199, the third cluster includes objects with indexes ranging from 200 to 299, and the fourth cluster includes objects with indexes ranging from 300 to 399.
We can see from the outputs that AGM achieves perfect clustering.
Average Group Method has the complexity of O(NxN).
Assuming the calculation of distances between the objects is expensive, applying Average Group Method directly to the data set is not feasible. Thus we can take a small sample set of m objects and use MetricMap to map the objects to vectors in the pseudo-Euclidean space in time O(Nxm).
Once the data objects are embedded in the target space, the calculation of distances between the vectors is much cheaper.
We then apply Average Group Method to the pairwise distances of the vectors in the pseudo-Euclidean space and cluster them.

For example, to apply Average Group Method directly to the RNA input data file coxrhi.dist, we execute the following command at the operating system level:

agm coxrhi.dist 200 2

The output on the screen looks like the following:

Everything is ready.

Initialize heap complete...

Clustering complete...

Cluster 0:

180 116 196 146 155 156 170 183 181 163 108 144 141 167 186 122 145 195 138 177
134 149 123 199 179 190 128 178 119 113 118 131 160 191 162 135 100 150 130 126
173 154 102 104 151 153 148 182 111 109 169 143 172 176 114 107 165 106 194 133
193 136 166 101 192 157 142 137 159 132 103 139 124 129 188 105 171 112 115 174
175 127 187 120 185 189 168 198 110 121 117 184 147 152 164 

Total: 95 objects.

Cluster 1:

17  18  70  68  26  86  20  22   9  43  29  62  58  64  33  65  76  75  60 27
72  30  32  28  87  97  93   5   4  46  92  99  50  91  90  42  81   3   7 16
55  80  88  45  37  71  73  39  66  23  13  14  56  89  10  52  19  12  69 54
25  47  40  67  53  36  41  35  84  78  49  95  21  15  77  82   1  83  51 8
 6   0  48  98  94  96  11   2  57  79  63  61  31  24  59 161 197 140 158 125
44  85  38  74  34 

Total: 105 objects.

Suppose we want to take a sample set of 120 RNA secondary structures, some coming from the human rhinovirus family and the other coming from the coxsackievirus family. Map the RNA secondary structures to a 80 dimensional space, and then use Average Group Method to cluster their image vectors in the pseudo-Euclidean space.

We can do this by typing the following commands at the operating system level:

femb 120 80 < coxrhi.dist
proj coxrhi.dist 200 80 cr.vect cr.dist
agm cr.dist 200 2

The output on the screen looks like the following:

Everything is ready.

Initialize heap complete...

Clustering complete...

Cluster 0:

115 112 139 188 171 124 129 105 103 168 186 162 116 167 130 169 156 182 101 106
178 111 114 128 109 113 104 108 107 149 123 100 199 135 177 193 134 102 163 131
194 183 184 190 148 144 192 151 180 179 143 146 126 133 150 166 191 138 160 136
189 195 125 187 127 120 145 165 175 140 196 164 137 159 158 161 142 172 157 197
181 173 185 118 119 154 153 132 176 170 155 141 122 174 147 152 110 198 117 121

Total: 100 objects.

Cluster 1:

36  41  35  67  53  47  40  77  15  21  59  31   6  48  57  82   2   8  51 83
96  79  63  61  95   1  11  49   0  24  78  84  98  94  44  74  38  85  34 27
76  75  65  60  30  26  68  70  18  17  20  22  86  32  28  87   9  29  72 43
33  62  37  71  73  66  23  39  45  88  42   3  80   7  16  81  55  90  69 54
12   4  50  92  99  91  97  93  46   5  56  13  14  64  58  52  19  89  10 25

Total: 100 objects.

3. The clustering program CURE:

CURE is published in ACM SIGMOD'98 by S. Guha, R. Rastogi, and K. Shim.

Its implementation is as follows.

There is one header file: cure.h

There are three source files: cure.c, cureuti.c, and distance.c

Executable file: cure

To compile:

cc -o cure cure.c

To run:

cure Vector_file N_vect N_cluster

Vector_file contains the vectors to be clustered, N_vect is the number of vectors to be clustered, and N_cluster is the number of clusters to be generated.

For example, the file "v400.20" contains 400 randomly generated 20 dimensional vectors.
The file looks like the following:

400 20

-0.1620 0.7580 -0.8870 0.5150 0.0510 0.6270 0.0100 0.4190 -0.7880 -0.9140
-0.2510 -0.2330 0.084
0 -0.9400 -0.7750 0.5430 0.0890 0.1830 0.1370 0.5660
-0.0340 -0.0220 -0.5050 -0.6890 0.3670 -0.9460 0.0310 0.1450 0.8820 0.7360
-0.4760 -0.4950 -0.6
060 -0.8980 -0.1490 0.0670 -0.2460 0.6530 -0.4390 0.0960
0.6280 0.1880 -0.9150 -0.8570 -0.0330 0.4060 -0.8350 0.4030 0.5620 -0.1660
0.3530 -0.0800 -0.55
60 -0.1970 0.9620 0.3180 0.4220 0.3270 -0.5430 0.9450
-0.5210 0.9830 -0.2490 0.8940 -0.3300 -0.7410 -0.7520 0.7570 0.6290 0.3060
-0.3940 0.9900 0.738
0 -0.4840 0.4140 0.2620 0.1160 -0.1750 0.1810 0.1340
0.3430 -0.9780 0.2330 0.5360 0.7600 0.9790 0.0710 0.2010 0.3360 0.0610
-0.8400 -0.9950 -0.2710
0.6440 0.4750 0.6930 0.5140 -0.8610 -0.9120 -0.4790
0.2020 0.1710 -0.5660 -0.6830 -0.4180 -0.1850 -0.4140 0.6530 -0.6940 0.1740
-0.5490 0.4480 -0.5
270 -0.5660 -0.8070 -0.8900 -0.2520 -0.7900 0.3200 -0.9510
-0.0440 -0.8380 -0.8340 -0.0030 0.7930 -0.6900 0.3910 0.7990 0.9260 -0.0950
0.8850 -0.4180 0.61
......
......

The number 400 indicates that there are 400 vectors.
The number 20 indicates that these are 20 dimensional vectors.
To apply CURE to v400.20 we run:

cure v400.20 400 4

The output on the screen looks like the following:

Everything is ready.

Initialize heap complete...

Clustering complete...

Cluster 0:

263 286 267 251 249 222 202 289 299 294 219 274 203 230 262 258 254 253 233 293
237 250 232 226 287 220 255 207 291 223 239 271 206 259 296 269 245 231 240 236
276 290 213 297 208 209 217 270 205 229 200 201 242 234 256 248 225 268 292 235
288 278 244 252 261 238 246 212 224 277 221 285 210 243 275 216 265 295 228 215
218 281 257 204 298 211 227 282 279 264 272 241 266 284 280 247 214 260 283 273

Total: 100 objects.

Cluster 1:

196 150 147 198 166 185 108 161 102 151 174 194 125 103 176 130 162 128 126 120
178 146 139 135 157 160 177 199 190 148 152 170 107 186 119 189 106 172 155 113
173 101 100 149 121 175 115 191 129 138 114 134 136 180 158 197 111 124 143 118
193 183 104 127 144 179 171 105 109 112 122 165 156 110 192 154 164 184 159 133
131 168 141 181 153 167 163 132 169 195 116 182 137 117 142 188 140 187 123 145

Total: 100 objects.

Cluster 2:

 27  85  30  66  31  54  58  10   9   3  11  24  53  36  21  76  94  17  37 71
  0  56  47  22  15  95  20  87  73  68  59  78  32   1  91   7  70  74  34  5
 65  33  75  79  18  42  81  52  45  98  40  50  44  62  93  84  61  41  25 46
 48  77  72  57  38   4  16  29  12  26   2  99  97  83  64  14  82  92  28 96
 55  86  89  19  69  39  23  60  90  35   6  63  80  49  88  13  67   8  43 51

Total: 100 objects.

Cluster 3:

302 318 326 352 397 300 346 395 341 391 303 337 363 379 366 310 329 362 325 374
305 353 385 321 367 347 335 323 317 376 307 332 373 312 351 365 349 334 399 340
336 306 350 370 328 393 304 343 301 359 342 390 375 345 316 330 364 354 394 378
348 333 320 339 396 387 388 386 309 356 322 389 398 308 371 313 319 382 355 361
377 381 392 357 372 324 315 369 383 368 327 384 380 314 358 338 331 360 311 344

Total: 100 objects.

Like AGM, CURE achieves a perfect clustering on these data.
We may map these 20 dimensional vectors to a 5 dimensional pseudo-Euclidean space by using MetricMap through the following commands:

femb 100 5 < d400.20
proj d400.20 400 5 d400.vect d400.dist

Notice that d400.20 contains the pairwise distances of the vectors in v400.20. The data in d400.vect now looks like the following:

400 5
  0.0000   0.0000   0.0000   0.0000   0.0000
 -2.9230  -0.1298   1.3652  -0.7308   0.9252
 -2.6179  -0.3876   1.0235  -0.1215  -0.4779
 -1.9882   0.5153   0.2969   0.0072  -0.4473
 -2.5031  -1.5796  -0.5920  -1.5827   0.5528
 -2.0708   1.2086   0.6483  -1.1900  -0.4187
 -2.7634   0.2079   1.0544   0.8281   0.9377
 -2.9396   0.4679  -0.1456  -0.2773   0.1866
 -2.5543   1.1021  -0.5163  -0.5840  -0.2894
 -2.7844  -0.1489  -0.1641   2.1055   0.0639
......
......

We may then apply CURE to the 5-dimensional vectors in d400.vect by executing the following command at the operating system level:

cure d400.vect 400 4

The output on the screen looks like the following:


Everything is ready.

Initialize heap complete...

Clustering complete...

Cluster 0:

220 285 224 211 286 287 204 202 290 299  27 203 282 267 244 270 229 263 276 248
246 228 298 279 232 231 219 200 225 259 241 206 294 222 218 288 271 245 262 281
235 280 215 264 250 239 269 247 293 261 255 252 266 205 240 236 230 212 207 292
242 257 295 291 265 213 217 209 201 278 233 223 260 216 254 289 284 214 238 275
296 283 268 253 256 234 208 297 258 249 221 277 237 272 210 226 243 251 274 227
273 

Total: 101 objects.

Cluster 1:

 51  57  58  74   5  65  11  10  28  52  29  12  56  41  25  68   6  46  87 60
 88   9  92  38  76  81  42  82  44  79  62  71  77  35  84  34  61  98  40 50
  8  85   3   7  15  66  99  90  86  69  72  24  33  70  20  64  39  16  53 32
  1  47  26  37  73  55   2  22  91  36  13  95  14  75  48  59  83  67  96 54
 43  30  49  17  97  21  19  80  45  18  93   4  89   0  31  23  94  63 

Total: 98 objects.

Cluster 2:

353 324 385 362 355 382 309 386 351 305 389 301 343 354 341 333 342 381 307 379
364 374 390 388 316 350 373 370 347 329 387 318 345 339 304 360 356 393 321 327
320 348 359 394 378 335 332 310 367 375 380 371 369 361 322 315 308 396 398 323
399 336 358 372 352 303 376 317 344 311 312 330 326 313 338 397 383 328 392 391
306 366 331 395 325 300 340 384 314 334 319 349 365 377 363 346 368 357 302 337
 78 

Total: 101 objects.

Cluster 3:

158 187 122 163 103 154 181 166 108 101 196 175 182 113 106 188 168 185 178 109
105 184 190 130 140 128 161 174 162 170 120 156 189 191 110 132 167 172 124 133
116 138 127 195 123 157 171 126 121 100 135 115 102 177 104 143 160 129 150 119
192 179 139 176 107 148 180 149 151 142 183 144 125 117 169 173 155 199 114 194
111 136 186 197 141 134 152 153 118 193 159 131 112 137 145 147 165 146 198 164

Total: 100 objects.

Notice that two objects, object 27 and 78, are mis-clustered.
The mis-clustered rate is 0.5%.
CURE is a vector based clustering algorithm, which can not be applied to general distance metric directly.
We used MetricMap to map objects to vectors so that CURE can be used to cluster those objects.

4. The best match search and range search program using MetricMap:

There are no header files.

There are 4 source files: distance.c, vector.c, uti1.c, and msearch.c.

Executable file: msearch

To compile:

cc -o msearch msearch.c -lm

To run:

msearch -b Target-object

msearch -r Target-object Epsilon-value

Given a target object and a database of objects, "msearch" performs best match retrieval with the "-b" option, or epsilon range search with the "-r" option.
For the best match retrieval, the msearch tool finds the object in the database that is closest to the target.
If there are more than one best match, the tool returns all the best-matching objects.
For the epsilon range search, the msearch tool finds the objects in the database whose distances to the target are within the given epsilon value.

Here we assume all the database objects have been embedded in a pseudo-Euclidean space using MetricMap.
The indexes of the reference objects are stored in a file called "b.ind" (cf. the MetricMap tool).
The image vectors of the objects are stored in a file called "v.base".

We give two annotated examples of sessions of the msearch system, one for best match retrieval and the other for epsilon range search.

Refer to the MetricMap system.
Suppose we have a database of 400 20-dimensional vectors, indexed from 0 to 399. These vectors (objects) are stored in file v400.20.
The pairwise distances of the objects are stored in file d400.20.
We can embed those objects in a 10-dimensional pseudo-Eucliden space by executing the following commands:

femb 40 10 < d400.20
proj d400.20 400 10 v.base v.dist

Recall that the "femb" program generates two files, "b.ind" and "v.base". The file "b.ind" contains the indexes of the reference objects, which looks like the following:

9 25 36 28 10 37 14 31 39 32

This means that object 9, object 25, ... object 32 are used as reference objects to build the target space.

The file "v.base" contains the image vectors of the objects, which looks like the following:


400 10
  0.0000   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000  
0.0000   0.0000
 -2.5781  -2.1680   0.5737   2.0081  -0.7672  -0.0987   0.2544  -0.3053  
0.3483   0.0392
 -2.3080   0.8428   0.4520   0.8193  -0.8753   0.0016   0.4910  -0.0696 
-0.2953  -0.2928
 -1.6848  -0.4318   1.1193   2.0989   0.6211   0.3638   0.1002   0.5503  
0.7714   0.5975
 -3.1964  -0.4338   0.3476  -1.4289  -1.5407   0.6868   1.0198   0.1987  
0.1276  -0.3175
 -1.8273   0.0457  -1.1576   0.3282  -0.1685  -0.7129  -1.0777   0.7811  
0.1593   0.9413
 -2.3284  -0.5072  -0.0256  -0.8463   0.4821   0.2637  -1.6375   1.1964  
0.4367  -1.2949
 -3.2492  -0.9738  -1.5265  -0.6101   0.5196   0.1557  -0.3079  -0.2439  
0.1784  -0.6968
 -1.9128   1.9896  -1.4410   1.2827  -0.9333   0.2214  -0.0460   2.0425  
0.8316   0.2288
 -2.6038  -0.7197   1.5676   1.9001   1.4665   0.6101   1.0496  -0.5040  
1.4032  -0.1418
 -3.4464  -0.2469   0.9019  -0.8877   0.1510   0.9234   0.4323   0.5043 
-0.2450   0.1873
 -1.6865   1.5494  -0.4617  -0.9208   0.8724  -0.1325   0.8989  -0.7223  
0.2492  -0.1926
 -1.5631  -0.2084   0.6216  -0.7920   0.3213  -0.2442   0.3770   0.2211  
0.5378   0.0152
 -2.3364   1.1241   2.0811   0.4693   0.6360  -1.7910  -1.2878  -0.5327 
-0.5601  -0.2646
......
......

The number 400 indicates that there are 400 vectors.
The number 10 indicates that these are 10-dimensional vectors.
Each line of this file represents the coordinates of an image vector in the pseudo-Euclidean space.

Now suppose we are given a target object in a file called "target".
For example, the data in the target file looks like the following:

-2.9375 -1.8045 -2.4635 -2.8335 -1.5975 -1.9245 -1.6905 -2.5165 -2.7335 -1.4645
 0.9480 -0.0230  0.5400 -0.3580 -0.0180  0.0440  0.7980 -0.7720  0.0490 -0.8840

To find the best match of the target, type the command:

msearch -b target

The output on the screen looks like the following:

The nearest neighbor of the target is object with the index 111.

Basically, the msearch tool first computes the distances from the target to the reference objects stored in the file b.ind. Based on these distances, the tool is able to embed the target in the pseudo-Euclidean space. The tool then finds best matches in the pseudo-Euclidean space.

For epsilon range search, suppose we want to find objects that are within distance 2 of the target object stored in the file "target". One may type in the following command:

msearch -r target 2

The output looks like the following:

The data objects that are within distance 2 of the target
have indexes 111, 124, 158, 180.

To speed up search in the target space, we have implemented the VA-file technique introduced by R. Webber et. al. in a paper entitled "A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces" which appeared in the Proc. of VLDB'98.

angle1.c,	ind1.c,	lud1.c,	ortho1.c,	qli1.c,
tri1.c,	uti1.c,	eigw.c,	fbilinw.c,	fchow.c,
fheadw.c,	prow.c,	repeatw.c,	project.c