TMatch: computing the editing distance of two ordered trees
Jason Wang
Department of Computer Science
New Jersey Institute of Technology
wangj@njit.edu
Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
Kaizhong Zhang
Department of Computer Science
University of Western
TMatch is a tool for comparing two rooted ordered labeled trees, which are represented in a preorder parenthesized form. The tool calculates the editing distance of the two trees and displays the edit operations (insertions, deletions and substitutions) needed to transform one tree to the other.
TMatch is written in C and can run in any Unix environment where C complier is available. In the links below, we have posted the source code of the software we developed.
Source files: tdistance.c
tdistance.c computes the editing distance between two trees.
Executable files: tdistance
Data files: FILE1
FILE1 contains two trees (in parenthesized preorder form).
To compile: gcc -o tdistance tdistance.c
To run: tdistance
You may download faster versions tdistance-fast.c, tdistance-faster.c, and a version tdistance-stringlabel.c that is capable of comparing trees whose nodes have string labels.
The tool TMatch compares two trees and finds the minimum number of edit operations that can transform one tree to the other. The edit operations used here are an extension of string edit operations, which include inserting a node, deleting a node, and changing the label of a node.
FILE1 contains two trees:
Tree1: (N(M(M(I(H))(H)(I))(B)))
Tree2: (H(M(B)(I)(M(I)(H)(I))))
Here is an indented form of Tree1:
N
M
M
I
H
H
I
B
and Tree2:
H
M
B
I
M
I
H
I
When running the TMatch program you will see the following lines at command prompt. At the first prompt you will enter the file name or use default parameter. At the second prompt you can use default parameter or write your own file name to see result. To use default parameter values, you may just press "enter" on the keyboard.
% Enter the file name of trees
(an example
file can be found in file FILE1;
maximum size of
trees is 200) [FILE1]:
FILE1
% Where the result should be stored (enter the file name) [data.out] ?
Here is a sample output after using the input FILE1:
Tree1: (N(M(M(I(H))(H)(I))(B)))
Tree2: (H(M(B)(I)(M(I)(H)(I))))
Distance: 5
( 1 B )
( 2 I )
( 1 H )
( 2 I 3 I )
( 3 H 4 H )
( 4 I 5 I )
( 5 M 6 M )
( 6 B )
( 7 M 7 M )
( 8 N 8 H )
Nodes are labeled by numbers based on a post-traversal order. Node 1 (H) and node 6 (B) in Tree1 are deleted. Node 1 (B) and node 2 (I) in Tree2 are inserted. Node 8 (N) in Tree1 is changed to node 8 (H) in Tree2. The other nodes are exact matches. Since there are 5 edit operations used to transform Tree1 to Tree2, the editing distance between Tree1 and Tree2 is 5.
Click here to see the graphical display of the TMatch output.
Jason T. L. Wang, Kaizhong Zhang, Karpjoo Jeong and Dennis Shasha, "A System for Approximate Tree Matching," IEEE Transactions on Knowledge and Data Engineering, Vol. 6, No. 4, August 1994, pp. 559-571.
Some browsers open the PDF file
and the Web page manuals and programs instead
of starting a download. If this happens, try right-clicking on
the link and choosing an option named "Save Target As..." or similar. If a
separate window is popped up, click "File" on the top bar menu of
the window and
click on "Save As" to save the file.