CIS 392 Text Retrieval and Mining

Fall 02

Syllabus (best viewed with IE)

Instructor: Yi-Fang Wu, Ph.D. (a.k.a Brook Wu)

Time: 10am-11:25am, Monday and Wednesday

Contact the Instructor:  

Tentative Catalog Description:

This course covers various text processing techniques.  Topics include: lexical processing, automatic indexing, retrieval models, search engines, retrieval effectiveness, information extraction, document classification, summarization, document warehousing, and tools for text mining.  Student projects involve working with multivariate analysis tools for text mining.

Prerequisites: 

CIS 114 Introduction to Computer Science II and Math 333 Probability and Statistics.

Why is this course special?

Quotes from Professor Wu's paper:

"Anecdotal evidence suggests that around 90 per cent of databases for large American Corporations consist of text (the remaining 10 per cent consisting of structured databases such as the traditional accounting records).  The preponderance of text in the corporate repositories of information, its pervasive role in electronic commerce and yet the scant attention accorded to its study in accounting is truly astonishing."  (adapted from: On the Automatic Classification of Accounting concepts: Preliminary Results of the Statistical Analysis of Term-Document Frequencies, by Jagdish Gangolly and Yi-fang Wu, published in New Review of Applied Expert Systems and Emerging Technologies, pp 81-88, v 6, 2000.)

What will you learn from this course?

Since text is the majority of business data, understanding text processing, warehousing, retrieval, and mining techniques will improve your competitive advantage in job market.  Some of Professor Wu's graduate students testified that their jobs involve processing online customer complaints, creating FAQs for their customers, or processing business documents.  Traditional structured database knowledge does not suffice in this area; text retrieval and mining knowledge is required.  Also, if you ever wonder how search engines find documents for you,  or why sometimes search engines are smart but most of the time aren't,  this is the right course for you as well.  We will talk about all of these interesting topics in class.

This type of courses are mostly offered at graduate level, but NJIT is offering it at both graduate and undergraduate level.  It is a great opportunity to learn new knowledge while you are still in college.  So, please take advantage of this course!  See you in class this Fall!

Probable textbook and references

Grading 

Tentative Schedule

Week #

Topics

Material

1

Welcome and Course logistics

 

What Do People Want from IR?

Paper by W. B. Croft

Text Processing and Retrieval

 

2

 

Expanding the Scope of Business Intelligence

Sullivan Ch1

 

Understanding the Structure of Text

 

 

 Sullivan Ch2

 

3

 

Exploiting the Structure of Text

Sullivan Ch3

Finding and Retrieving Relevant Text

Sullivan Ch7

 

 

4

Indexing

Sullivan Ch8

Indexing (cont)

Sullivan Ch8

 

5

Retrieval Performance Improvement

Sullivan Ch7

Document Classification, Clustering, and Summarization

Sullivan Ch8

 

 

6

Self-Organizing Maps and Document Classification

 

SOM_PAK manual

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Paper by Brin and Page

Document Warehousing

 

7

 

Overview of Document Warehousing

Sullivan Ch4

 Meeting Business Intelligence Requirements

Sullivan Ch5

 

8

Mid-term exam

 

Exam Discussions

 

 

9

Designing Document Warehousing Architecture

Sullivan Ch6

Designing Document Warehousing Architecture (cont)

Sullivan Ch6

 

10

Managing Document Warehouse Metadata

Sullivan Ch9

Ensuring Document Warehouse Integrity

Sullivan Ch10

 

11

Choosing tools for Building Document Warehousing

Sullivan Ch11

 Developing a Document Warehouse

Sullivan Ch12

Text Mining

 

12

 

Overview of Text Mining

Sullivan Ch13

 Text mining Tools

Sullivan Ch17

 

13

 

WEKA and Text Mining

WEKA handouts

 Text Mining for Operational Management

Sullivan Ch14

14

Text Mining for Customer Relationship Management

Sullivan Ch15

Text Mining for Competitive Intelligence

 

Sullivan Ch16

15

Review and Exam