home | cv & research | award & book | grants | teaching | students | contact
Dr. Xin Chen
Dissertation Title:
 

Text Mining with the Exploitation of User's Background Knowledge Discovering Novel Association Rules from Text

 
 
Dissertation committee:
 

§ Advisor: Dr. Brook Wu

§ Members: Dr. Murray Turoff, Dr. Vincent Oria, Dr. Il Im, Dr. Marcia L. Zeng (Kent State University).

 
Abstract:
 

The goal of text mining is to find interesting and non-trivial patterns or knowledge from unstructured documents. Both objective and subjective measures have been proposed to evaluate the interestingness of discovered patterns. However, objective measures alone are insufficient because they do not considering users¡¦ knowledge and interests. Subjective measures require explicit input of user expectations which is difficult or even impossible to obtain in text mining environments.

This study proposes a user-oriented text-mining framework, and applies it to the problem of discovering novel association rules from documents. The system, uMining, consists of two major components: background knowledge developer and novel association rules miner. Background knowledge is developed from documents already known to the user (background documents), and modeled as a key word space with a concept hierarchy developed inside. Target documents are retrieved from a large corpus by selecting documents that are relevant to the user¡¦s background. Association rule miner discovers association rules among noun phrases extracted from target documents.

The user-oriented novelty measure is developed to evaluate the interestingness (novelty and usefulness) of association rules, and it is defined as the semantic distance between the antecedent and the consequent of a rule in the background knowledge key word space. The novelty measure is decomposed into two components: occurrence distance and connection distance. The former looks at the overlapping area of two keywords: the more they overlap, the less the distance is. The latter calculates the distance between two key words in the concept hierarchy, which is the length of the shortest path connecting the two key words in the hierarchy. The longer the path is, the larger the distance is.

The evaluation focused on studying the novelty prediction accuracy and the usefulness indication power of the user-oriented novelty measure. The results show that the user-oriented novelty measure has high novelty prediction accuracy, and it outperforms the WordNet novelty measure and the Support and Confidence measures in novelty prediction. It is also found that the user-oriented novelty measure has high usefulness indication power and it outperforms the WordNet novelty and other seven objective interestingness measures in usefulness indication.