CIS 392 Text Retrieval and Mining
Fall 02
Syllabus (best viewed with IE)
Instructor: Yi-Fang
Wu, Ph.D.
Time: 10am-11:25am, Monday and Wednesday
Contact the Instructor:
In person: GITC 5502.
By Phone: (973) 596 - 5285
By e-mail (fastest way to get my attention): wu@njit.edu
Tentative Catalog Description:
This course covers various text processing techniques. Topics include: lexical processing, automatic indexing, retrieval models, search engines, retrieval effectiveness, information extraction, document classification, summarization, document warehousing, and tools for text mining. Student projects involve working with multivariate analysis tools for text mining.
Prerequisites:
CIS 114 Introduction
to Computer Science II and Math 333 Probability
and Statistics.
Why is this course special?
Quotes from Professor Wu's paper:
"Anecdotal evidence suggests that around 90 per cent of databases for large American Corporations consist of text (the remaining 10 per cent consisting of structured databases such as the traditional accounting records). The preponderance of text in the corporate repositories of information, its pervasive role in electronic commerce and yet the scant attention accorded to its study in accounting is truly astonishing." (adapted from: On the Automatic Classification of Accounting concepts: Preliminary Results of the Statistical Analysis of Term-Document Frequencies, by Jagdish Gangolly and Yi-fang Wu, published in New Review of Applied Expert Systems and Emerging Technologies, pp 81-88, v 6, 2000.)
What will you learn from this course?
Since text is the majority of business data, understanding text processing, warehousing, retrieval, and mining techniques will improve your competitive advantage in job market. Some of Professor Wu's graduate students testified that their jobs involve processing online customer complaints, creating FAQs for their customers, or processing business documents. Traditional structured database knowledge does not suffice in this area; text retrieval and mining knowledge is required. Also, if you ever wonder how search engines find documents for you, or why sometimes search engines are smart but most of the time aren't, this is the right course for you as well. We will talk about all of these interesting topics in class.
This type of courses are mostly offered at graduate level, but NJIT is offering it at both graduate and undergraduate level. It is a great opportunity to learn new knowledge while you are still in college. So, please take advantage of this course! See you in class this Fall!
Probable
textbook and references
Textbook:
Sullivan, Dan. Document
Warehousing and Text Mining: Wiley, 2001
Paper:
Brin, Sergey, and Lawrence
Page. “The Anatomy of a Large-Scale Hypertextual Web Search Engine.”
Computer Networks and ISDN Systems 30 (1998): 107-117.
Paper:
Croft, W. Bruce. “What Do
People Want from Information Retrieval?” D-Lib Magazine November, 1995.
Grading
Three
assignments
30%
Pop-up
quizzes
10%
Mid
term exam
20%
Final
exam
30%
Participation
10%
Tentative
Schedule
Week
# |
Topics
|
Material
|
1
|
Welcome
and Course logistics |
|
What Do People Want from IR?
|
||
Text
Processing and Retrieval
|
||
2
|
Expanding
the Scope of Business Intelligence |
Sullivan
Ch1
|
Understanding the Structure
of Text |
Sullivan
Ch2 |
|
3
|
Exploiting
the Structure of Text |
Sullivan
Ch3 |
Finding
and Retrieving Relevant Text |
Sullivan
Ch7 |
|
4 |
Indexing |
Sullivan
Ch8 |
Indexing
(cont) |
Sullivan
Ch8 |
|
5 |
Retrieval
Performance Improvement |
Sullivan
Ch7 |
Document
Classification, Clustering, and Summarization |
Sullivan
Ch8 |
|
6 |
Self-Organizing
Maps and Document Classification |
SOM_PAK
manual |
The
Anatomy of a Large-Scale Hypertextual Web Search Engine |
Document
Warehousing
|
||
7
|
Overview
of Document Warehousing |
Sullivan
Ch4 |
Meeting
Business Intelligence Requirements |
Sullivan
Ch5 |
|
8 |
Mid-term
exam
|
|
Exam
Discussions |
|
|
9
|
Designing
Document Warehousing Architecture |
Sullivan
Ch6 |
Designing
Document Warehousing Architecture (cont) |
Sullivan
Ch6 |
|
10
|
Managing
Document Warehouse Metadata |
Sullivan
Ch9 |
Ensuring
Document Warehouse Integrity |
Sullivan
Ch10 |
|
11
|
Choosing
tools for Building Document Warehousing |
Sullivan
Ch11 |
Developing
a Document Warehouse |
Sullivan
Ch12 |
|
Text
Mining
|
||
12
|
Overview
of Text Mining |
Sullivan
Ch13 |
Text
mining Tools |
Sullivan
Ch17 |
|
13
|
WEKA
and Text Mining |
WEKA
handouts |
Text
Mining for Operational Management |
Sullivan
Ch14 |
|
14 |
Text
Mining for Customer Relationship Management |
Sullivan
Ch15 |
Text
Mining for Competitive Intelligence |
Sullivan
Ch16 |
|
15 |
Review and Exam |
|