|
Document representation (indexing) techniques are dominated by variants
of the term-frequency analysis approach, based on the assumption that the
more occurrences a term has throughout a document the more important the
term is in that document. Inherent drawbacks associated with this approach
include: poor index quality, high document representation size and the
word mismatch problem. To tackle these drawbacks, a document representation
improvement method called the Relevance Feedback Accumulation (RFA) algorithm
is presented. The algorithm provides a mechanism to continuously accumulate
relevance assessments over time and across users. It also provides a document
representation modification function, or document representation learning
function that gradually improves the quality of the document representations.
To improve document representations, the learning function uses a data
mining measure called “support” for analyzing the accumulated
relevance feedback.
Evaluation is done by comparing the RFA algorithm to other four algorithms.
The four measures used for evaluation are (a) average number of index terms
per document; (b) the quality of the document representations assessed by
human judges; (c) retrieval effectiveness; and (d) the quality of the document
representation learning function. The evaluation results show that (1) the
algorithm is able to substantially reduce the document representations size
while maintaining retrieval effectiveness parameters; (2) the algorithm provides
a smooth and steady document representation learning function; and (3) the
algorithm improves the quality of the document representations. The RFA algorithm’s
approach is consistent with efficiency considerations that hold in real information
retrieval systems.
The major contribution made by this research is the design and implementation
of a novel, simple, efficient, and scalable technique for document representation
improvement.
|