CIS 634 INFORMATION RETRIEVAL

Syllabus for Distance Learning Section

Fall 2004

(subject to change)

Instructor for the DL section and Course Coordinator: Yi-Fang Wu, Ph.D. (a.k.a. Brook Wu)

Instructor for the FTF section: Mr. Quanzhi Li

Face_To_Face Class Location & Time: KUPF  104, 6:00pm-9:05pm, Wednesday

Contact the Instructors:  

    for Dr. Wu:

   for Quanzhi Li:

Class Web Board: http://webboard.njit.edu:8080/~F2004CIS634-101,851/

Outdated Catalog Description (from NJIT web site)

Information Retrieval 3 credits

Prerequisites: CIS 631. Covers the concepts and principles of information retrieval systems design. Techniques essential for building text databases, document processing systems, office automation systems, and other advanced information management systems.

Introduction (by the instructor)

Information retrieval (IR) is a fast-changing field concerning the representation, organization, storage, and retrieval of information items.  A broader definition of data types of information items includes text, numbers, multimedia and more, while a narrower definition includes only text.  The instructor recognizes the fact that there are courses offered for other types of retrievals, so this course will focus mostly on text retrieval.  Even so, there are still many topics to be covered.  

The importance of text retrieval is obvious.  Most business data is in text format.  However, most text is not as well organized as numerical data stored in commercial databases.  This, along with linguistic complexity, has caused the low performance of text retrieval.  To achieve high retrieval effectiveness, techniques such as automatic indexing, query expansion, local context analysis, information extraction, text mining, and many more have been developed to overcome problems in IR.  As an information professional, you should know how to use these techniques to organize, store, and retrieve text effectively and efficiently. 

This course is designed to address both theories and practices of IR.  It consists of two parts: introduction to IR theories, and hands-on experience of retrieval using a readily available systemFall 2003 new added topics: document warehouse and text mining.

Please note: office automation systems and advanced information management systems listed in the course catalog for CIS 634 are not covered in this course. 

Class Conduct and Attendance

This is a graduate course.  As an NJIT graduate student, you must follow the Institute's academic rules.  Please refer to student handbook for details.  Specifically:

§         Academic dishonesty is not allowed and will be reported to Dean of Student Services.  

§         All written assignments will be sent to http://www.turnitin.com/, a plagiarism prevention system, for verification.

§         Late assignments will be penalized 25% a day.

§         You are required to check webboard announcements at least 3 times a week.   

§         When posting messages on class web board, respect others and be considerate.     

§         You are expected to attend all class meetings ON TIME.  Class attendance will be recorded every meeting.  Poor attendance will negatively affect your semester grade. 

§         If you have to drop the class after you are assigned to a project group, please notify the instructor and your group members immediately.

§         Snow closing information will be available on the NJIT web site.  The instructor does not make the decision.

 

Textbooks

Required 1: Information Storage and Retrieval, by Robert R. Korfhage, Publisher: John Wiley & Sons (ISBN: 0471143383)

Required 2: Trailblazing a Path Towards Knowledge and Transformation, by HsinChun Chen, It is available at its entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/Chen2Book.pdf (a small booklet containing only 80 5"x7" pages)  Please finish reading it by the end of 4th week. 

Required 3: Knowledge Management Systems: A Text Mining Perspective, by HsinChun Chen, It is available at its entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/chenKMSi.pdf  (a small booklet containing only 50 5"x7" pages)  Please finish reading it by the end of 9th week.

Optional 1: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales, by Dan Sullivan, Publisher: Wiley, 2001 (ISBN: 0471399590)

Optional 2: Information Retrieval (2nd Ed), by C. J. van Rijsbergen, Publisher: London: Butterworths, 1979.  It is available at its entirety at the author's web site: http://www.dcs.gla.ac.uk/Keith/Preface.html

Supplemental readings are available through the links on the course schedule below.

 

Assignments, Grading, and Due Dates (subject to change):

Note:  All written assignments should be word-processed and posted on web board as WORD or PDF attachments.  Do not submit hard-copies.  The group presentation slides should be posted as PowerPoint attachments.  Remember to specify your name on it, but do not list your social security number!!  Most importantly, all written assignments should follow proper citation style!

1. Participation (attendance, self-intro, group competition, in-class and webboard contributions)  10%

2. An Online Search Log                                         5%

3. Two Retrieval Experiments                                  20 %

4. Semester Project                                                 35%

5. Final Exam                                                          30%

          Total                                                                   100%

Following is an overview to the assignments and projects; more details will be provided in class.  

1. Participation (10%): 

Design Part:

Instructions: You might want to use visio or MS word to draw the flow charts. Remember to attach files.
Remember one very important thing: Always think about how large the web is and complicated it is to generate index for web documents. So that the way a search engine works is:

1. document collection: use spiders or crawlers to collect documents and gather basic info such as URL, title, last updated date, etc.
2. indexing: parse web documents to get a list of unique terms and therefore, a final list of index. several steps: stop words removal, stemming, and Zipf's law (high and low frequent words removal).

all 2 steps have to be done for the whole document collection before the search system can be used by users.

3. search interface: accepts queries from users.

4. retrieval: the system compares the query with index for each document in the document collection.

4-a. simple Boolean queries: the retrieval component will check the presence and/or absence of query terms in the document index.
4-b. similarity using distance or angular based method: calculation between query vector and all document vectors are required.

                Voting Part:

For voting, please select 3 best designs from those by DL students.  the most important criterion is if the design works.  Other things like efficiency is used to rank the best designs.

 

 

2. An Online Search Log (5%):  This assignment appears to be easy and straightforward.  However, if you carefully record the whole experience, you shall find many IR problems during the whole exercise.  The instructor will show you how this exercise relates to other IR topics later in the semester.

§        Part I: Search and record any useful information for the task assigned.  There is no limitations on the number of resources you can use, as long as they are web search engines, web directories, or electronic journal databases available at NJIT library web site.  (The instructor might not have access to other resources.)

Task: Please find necessary information on "how to setup a small network for three desktops running Windows XP at home?"  (Note: 1. The sentence describing the task should not be the query you use to search for information.  You should come up with your own queries.  Through this experience, you would learn why it is difficult for users to find information.  2. Even if you know the answer without the need to search for information, please pretend you don't know the answer and go on with the assignment.) 

For each search session, be sure to record the following items along the search process:

1.      The search engine/directory/electronic database you used 

2.      The search query you entered.  (If you entered several queries with one same search tools, treat them as different sessions.)

3.      Number of search hits returned.

4.      In the top twenty returned documents, find out the number of documents/hits actually relevant to your query and their URLs (or document titles, if an electronic database is used). 

5.      Among all returned hits, what is the number of returned documents you browsed through?  

Note:

§         If you use more than one query for a particular search engine, please record item 2, 3, 4, and 5, repeatedly.

§         If you do not find useful information and decide to use other resources, please repeat the above steps.

§         Clearly mark the URL or title of the best site/best paper obtained from your search.   

§         Part II: Read "What Do People Want from IR" by Croft.  Based on your experience as a web search engines and/or text databases user, write a short paper called "What Do I need from IR?"  List at least, but not limited to, 3 items.  You do not need to pick items from Croft's paper.  For each item, please provide examples/frustrations you have experienced from the first part of this assignment.  Limit your response to 500 words.

3. Retrieval Experiments (20%):  Note that each student will work individually and will use a unique set of queries; therefore, none of the results of your experiments will be identical.  Results of both retrieval experiments should be posted on your own web site.  Note: To gain access to the test collections and the IR system, all students should get an AFS account.  Please follow instructions on http://newaccount.njit.edu/.  The instructor will NOT help you to obtain the account.  Please direct all your questions to Computing Help Desk 973-596-2900.

 

Assigned queries for your RE I and II 

 

(will be filled out soon after withdraw deadline.)

Q1 Jimin R. Bhuptani
Q2 Keerti K. Chivakula
Q3 Paul L. Cihak
Q4 Arun Dabas
Q5 Dr. Wu
Q6 Mircea Dascaloiu
Q7 Thomas J. German
Q8 Mythili Jammalamadaka
Q9 Latha Kalidindi
Q10 Heather L. Kile
Q11 Mineshkumar D. Lad
Q12 Derek K. Linebarger
Q13 Tejal B. Mistry
Q14 Syed Salman Mohsin
Q15 Prerak J. Parikh
Q16 Mayur R. Patel
Q17 Paavan A. Pujara
Q18 William Michael Rosellini
Q19 Heny Shah
Q20 Jeffrey L. Spector
Q24 Sandhya Srinivasan
Q25 Richard Wang
Q26  
Q27  
Q28  
Q29  
Q30  
Q31  
Q32  
Q33  
Q34  
Q35  

 

§         RE I (10%): You are required to operate an IR system, a test collection, and a query, and then run several retrieval experiments using Arrow of BOW toolkit (see Resources section of the syllabus).  The results of your Retrieval Experiment 1 should look like this page: http://www-ec.njit.edu/~wu/cis634/cis634re1.html 

§         RE II (10%): You are required to create a small document collection based on the results from your RE1, a document-term matrix, and perform documents classification/clustering analysis using Rainbow and Nenet (see Resources below). The results of your Retrieval Experiment 2 should look like this page: http://web.njit.edu/~wu/cis634/cis634re2.html    

4. Semester Group Project (35%): Two options:  case analysis project or programming project.  Instructions here.

Option 1: Case analysis

You are required to make up a fictional client and its business problems.

§         Part I (5%): A proposal define your client, its business problems and at least 3 possible sources of documents. 

§         Part II (25%): Use text mining software programs to collect, pre-process, index, mine and analyze the documents you collected.   Deliverables: 1. A CD containing all documents you collected; 2. Power Point slides; 3. A final report containing a. 1 page execute summary, b. main report: tools used and screen shots of outputs, c. analysis and recommended solutions to the client's business problems. 

§         Part III (5%): Presentation 

Option 2: Programming Project 

You can design your own project with the instructor's approval.  Sample choices are: text retrieval systems, information extraction systems, automatic summarizations, etc.   

§         Part I (5%): A proposal describing your project, including systems functions and tools that you will use to develop the system.  Please use flow charts to demonstrate your idea.

§         Part II (25%): A. System development using any programming language that your group is most familiar with.  (You will have all the necessary IR concepts and theories from lectures.  The instructor will not spend time on discussing the implementation.  For example, the instructor will discuss what is an inverted file and how it can be used for automatic indexing and retrieval, but not how to use Java or C++ to generate the inverted file.)  B. System design and documentation (including flow charts and user manual), and the evaluation of system performance. 

§         Part II (5%): Presentation

5. Final Exam (30%):  it covers all previous lectures and two booklets written by Dr. HsinChun Chen. 

 

Resources  

1. from The Information Retrieval Group at University at Glasgow

§         A list of stopped wordshttp://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words  

§         Test Collections: http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/ 

2. from McCallum, Andrew Kachites, Computer Science Dept, Carnegie Mellon University

§         "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering," 1996.  http://www.cs.cmu.edu/~mccallum/bow.

§         Arrow help file

§         Crossbow help file

§         Rainbow help file

3. from Neural Networks Research Centre, HELSINKI UNIVERSITY OF TECHNOLOGY

§         SOM_PAK

§         Nenet

4. from AI Lab, University of Arizona

§         Web Spiders

Schedule (subject to change, last updated August/27/2004)

Week

Topic

Readings

Due Dates

09/01

Course Logistic and Overview

IR Academic Resources

Document and Query Forms

"What do people want from IR" by Croft  

Korfhage Ch1-2

Van Rijsbergen Ch1-2

 

2  

09/8

Data Compression

Query Structures 

Korfhage Ch2 - 3  

Self-introduction on webboard due.  (Sept 11)  

09/15

Matching Process

Text Analysis

Korfhage Ch 3-5

Rijsbergen Ch2 (The Zipf's Law Part only)

On-line search log due (Sept 18)

09/22

TF.IDF

Basics of UNIX

Experiencing an IR System

Retrieval Experiment I Instructions

Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering

 

Finish reading Dr. Chen's book "Trailblazing a Path Towards Knowledge and Transformation"

09/29

Document Similarity

Document and Concept Classification/Clustering Techniques

Overview of Final Projects

Rijsbergen Ch3

On the Automatic Classification of Accounting concepts: Preliminary Results of the Statistical Analysis of Term-Document Frequencies, by Gangolly and Wu, published in New Review of Applied Expert Systems and Emerging Technologies, pp 81-88, v 6, 2000.

IR system design competition Due Oct/02.

10/06

Text Mining using Neural Networks

SOM and Nenet package overview

Retrieval Experiment II Instructions

AI LAB: A Scalable Self-Organizing Map Algorithm for Textual Classification: A Neural Network Approach to Automatic Thesaurus Generation (Roussinov & Chen, 1998) Both:

1. Retrieval Exp I due, and

2. Votes for best IR system design Due

(Oct 09)

10/13

Document Warehousing Sullivan Ch1  

10/20

Information Extraction

Text Mining Applications

1. Sullivan Ch13

2. Information Extraction: Techniques and Challenges, by Ralph Grishman

3. An interactive system for finding complementary literatures: a stimulus to scientific discovery, Artificial Intelligence, Volume 91, Issue 2, April 1997, Pages 183-203 Don R. Swanson and Neil R. Smalheiser

4. Combining Data and Text Mining Techniques For Analyzing Financial Reports

Semester Project Proposal due (Oct 16)

 9

10/27

 Retrieval Effectiveness Measures

Output Presentation

Korfhage Ch 8, 11

Rijsbergen Ch 7 

Retrieval Exp II due (Oct 30)

Finish reading Dr. Chen's book "Knowledge Management Systems: A Text Mining Perspective"

10 

11/03

IR Effectiveness Improvement Techniques: Relevance Feedback, Query Expansion, Local Context Analysis, and Word-Sense Disambiguation

Korfhage Ch 9

 

11  

11/10

User Profiles

Alternative Retrieval Techniques

Korfhage Ch 6, 10

S. Chakrabarti, B. Dom and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD 1998.  http://www.cs.toronto.edu/~wtjioe/mining/hypertext.pdf

J. Kleinberg. Authoritative sources in a hyperlinked environment.
http://www.cs.cornell.edu/home/kleinber/auth.pdf

 

12

11/17

Natural Language Processing

Little Words Can Make a Big Difference for Text Classification by Ellen Riloff

  

13

11/24

Thanksgiving. No Class.

14

12/01

Final Exam

      

covering all previous lectures and the 2 booklets by Dr. HsinChun Chen

 

15 

12/08

Semester Project Presentation 

  Final Project Due midnight Dec/07