TMatch: computing the editing distance of two ordered trees

 

Jason Wang
Department of Computer Science
New Jersey Institute of Technology
wangj@njit.edu

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University

Kaizhong Zhang
Department of Computer Science
University of Western Ontario


Introduction

 

TMatch is a tool for comparing two rooted ordered labeled trees, which are represented in a preorder parenthesized form. The tool calculates the editing distance of the two trees and displays the edit operations (insertions, deletions and substitutions) needed to transform one tree to the other.

 


Installation

 

TMatch is written in C and can run in any Unix environment where C complier is available. In the links below, we have posted the source code of the software we developed.

 

Source files:      tdistance.c

                        tdistance.c computes the editing distance between two trees.

Executable files: tdistance

 

Data files:         FILE1

FILE1 contains two trees (in parenthesized preorder form).

           

To compile: gcc -o tdistance tdistance.c

 

To run: tdistance

 

You may download faster versions tdistance-fast.c, tdistance-faster.c, and a version tdistance-stringlabel.c that is capable of comparing trees whose nodes have string labels.

 


Input

 

The tool TMatch compares two trees and finds the minimum number of edit operations that can transform one tree to the other. The edit operations used here are an extension of string edit operations, which include inserting a node, deleting a node, and changing the label of a node.

 

FILE1 contains two trees:

 

Tree1: (N(M(M(I(H))(H)(I))(B)))

Tree2: (H(M(B)(I)(M(I)(H)(I))))

 

Here is an indented form of Tree1:

N

  M

      M

          I

            H

          H

          I

      B

 

and Tree2:

H

   M

      B

      I

      M

          I

          H

          I

 


Output

 

When running the TMatch program you will see the following lines at command prompt.  At the first prompt you will enter the file name or use default parameter.   At the second prompt you can use default parameter or write your own file name to see result. To use default parameter values, you may just press "enter" on the keyboard.  

 

% Enter the file name of trees

  (an example file can be found in file FILE1;

  maximum size of trees is 200) [FILE1]: FILE1

 

% Where the result should be stored (enter the file name) [data.out] ?

 

Here is a  sample output after using the input FILE1:

 

Tree1: (N(M(M(I(H))(H)(I))(B)))

Tree2: (H(M(B)(I)(M(I)(H)(I))))

Distance:    5

(             1 B  )

(              2 I  )

(   1 H            )

(   2 I       3 I  )

(   3 H     4 H )

(   4 I        5 I )

(   5 M   6 M  )

(   6 B            )

(   7 M   7 M )

(   8 N    8 H )

 

Nodes are labeled by numbers based on a post-traversal order. Node 1 (H) and node 6 (B) in Tree1 are deleted. Node 1 (B) and node 2 (I) in Tree2 are inserted. Node 8 (N) in Tree1 is changed to node 8 (H) in Tree2. The other nodes are exact matches. Since there are 5 edit operations used to transform Tree1 to Tree2, the editing distance between Tree1 and Tree2 is 5.

 

Click here to see the graphical display of the TMatch output.

 


Citation

Jason T. L. Wang, Kaizhong Zhang, Karpjoo Jeong and Dennis Shasha, "A System for Approximate Tree Matching," IEEE Transactions on Knowledge and Data Engineering, Vol. 6, No. 4, August 1994, pp. 559-571.


Download Issues

Some browsers open the PDF file and the Web page manuals and programs instead of starting a download. If this happens, try right-clicking on the link and choosing an option named "Save Target As..." or similar. If a separate window is popped up, click "File" on the top bar menu of the window and click on "Save As" to save the file.