CIS 392 Text Retrieval and Mining

Fall 02

Syllabus (best viewed with IE)

Instructor: Yi-Fang Wu, Ph.D. (a.k.a Brook Wu)

Time: 10am-11:25am, Monday and Wednesday

Contact the Instructor:  

Tentative Catalog Description:

This course covers various text processing techniques.  Topics include: lexical processing, automatic indexing, retrieval models, search engines, retrieval effectiveness, information extraction, document classification, summarization, document warehousing, and tools for text mining.  Student projects involve working with multivariate analysis tools for text mining.


CIS 114 Introduction to Computer Science II and Math 333 Probability and Statistics.

Why is this course special?

Quotes from Professor Wu's paper:

"Anecdotal evidence suggests that around 90 per cent of databases for large American Corporations consist of text (the remaining 10 per cent consisting of structured databases such as the traditional accounting records).  The preponderance of text in the corporate repositories of information, its pervasive role in electronic commerce and yet the scant attention accorded to its study in accounting is truly astonishing."  (adapted from: On the Automatic Classification of Accounting concepts: Preliminary Results of the Statistical Analysis of Term-Document Frequencies, by Jagdish Gangolly and Yi-fang Wu, published in New Review of Applied Expert Systems and Emerging Technologies, pp 81-88, v 6, 2000.)

What will you learn from this course?

Since text is the majority of business data, understanding text processing, warehousing, retrieval, and mining techniques will improve your competitive advantage in job market.  Some of Professor Wu's graduate students testified that their jobs involve processing online customer complaints, creating FAQs for their customers, or processing business documents.  Traditional structured database knowledge does not suffice in this area; text retrieval and mining knowledge is required.  Also, if you ever wonder how search engines find documents for you,  or why sometimes search engines are smart but most of the time aren't,  this is the right course for you as well.  We will talk about all of these interesting topics in class.

This type of courses are mostly offered at graduate level, but NJIT is offering it at both graduate and undergraduate level.  It is a great opportunity to learn new knowledge while you are still in college.  So, please take advantage of this course!  See you in class this Fall!

Probable textbook and references


Tentative Schedule

Week #




Welcome and Course logistics


What Do People Want from IR?

Paper by W. B. Croft

Text Processing and Retrieval




Expanding the Scope of Business Intelligence

Sullivan Ch1


Understanding the Structure of Text



 Sullivan Ch2




Exploiting the Structure of Text

Sullivan Ch3

Finding and Retrieving Relevant Text

Sullivan Ch7





Sullivan Ch8

Indexing (cont)

Sullivan Ch8



Retrieval Performance Improvement

Sullivan Ch7

Document Classification, Clustering, and Summarization

Sullivan Ch8




Self-Organizing Maps and Document Classification


SOM_PAK manual

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Paper by Brin and Page

Document Warehousing




Overview of Document Warehousing

Sullivan Ch4

 Meeting Business Intelligence Requirements

Sullivan Ch5



Mid-term exam


Exam Discussions




Designing Document Warehousing Architecture

Sullivan Ch6

Designing Document Warehousing Architecture (cont)

Sullivan Ch6



Managing Document Warehouse Metadata

Sullivan Ch9

Ensuring Document Warehouse Integrity

Sullivan Ch10



Choosing tools for Building Document Warehousing

Sullivan Ch11

 Developing a Document Warehouse

Sullivan Ch12

Text Mining




Overview of Text Mining

Sullivan Ch13

 Text mining Tools

Sullivan Ch17




WEKA and Text Mining

WEKA handouts

 Text Mining for Operational Management

Sullivan Ch14


Text Mining for Customer Relationship Management

Sullivan Ch15

Text Mining for Competitive Intelligence


Sullivan Ch16


Review and Exam