CIS 634 INFORMATION RETRIEVAL

Syllabus for Distance Learning Section

Fall 2003

(subject to change)

Instructor: Yi-Fang Wu, Ph.D. (a.k.a Brook Wu)

Class Web Board: http://webboard.njit.edu:8080/~F2003CIS634-851

Contact the Instructor:  

·         In person: GITC 5502, Hours: 11:30 - 13:00, Monday and 16:30-17:45, Friday. 

·         By e-mail (fastest way to get a response): wu@njit.edu

·        By Phone: (973) 596 - 5285  

Outdated Catalog Description (from NJIT web site)

Information Retrieval 3 credits

Prerequisites: CIS 631. Covers the concepts and principles of information retrieval systems design. Techniques essential for building text databases, document processing systems, office automation systems, and other advanced information management systems.

Introduction (by the instructor)

Information retrieval (IR) is a fast-changing field concerning the representation, organization, storage, and retrieval of information items.  A broader definition of data types of information items includes text, numbers, multimedia and more, while a narrower definition includes only text.  The instructor recognizes the fact that there are courses offered for other types of retrievals, so this course will focus mostly on text retrieval.  Even so, there are still many topics to be covered.  

The importance of text retrieval is obvious.  Most business data is in text format.  However, most text is not as well organized as numerical data stored in commercial databases.  This, along with linguistic complexity, has caused the low performance of text retrieval.  To achieve high retrieval effectiveness, techniques such as automatic indexing, query expansion, local context analysis, information extraction, text mining, and many more have been developed to overcome problems in IR.  As an information professional, you should know how to use these techniques to organize, store, and retrieve text effectively and efficiently. 

This course is designed to address both theories and practices of IR.  It consists of two parts: introduction to IR theories, and hands-on experience of retrieval using a readily available systemFall 2003 new added topics: document warehouse and text mining.

Please note: office automation systems and advanced information management systems listed in the course catalog for CIS 634 are not covered in this course. 

Class Conduct and Attendance

This is a graduate course.  As an NJIT graduate student, you must follow the Institute's academic rules.  Please refer to student handbook for details.  Specifically:

§         Academic dishonesty is not allowed and will be reported to Dean of Student Services.  

§         All written assignments will be sent to www.turnitin.com, a plagiarism prevention system, for verification.

§         Late assignments will be penalized 25% a day.

§         You are required to check webboard announcements at least 3 times a week.   

§         When posting messages on class web board, respect others and be considerate.     

§         If you have to drop the class after you are assigned to a project group, please notify the instructor and your group members immediately.

§         Snow closing information will be available on the NJIT web site.  The instructor does not make the decision.

 

Textbooks

Required 1: Information Storage and Retrieval, by Robert R. Korfhage, Publisher: John Wiley & Sons (ISBN: 0471143383)

Required 2: Trailblazing a Path Towards Knowledge and Transformation, by HsinChun Chen, It is available at its entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/Chen2Book.pdf (a small booklet containing only 80 5"x7" pages)  Please finish reading it by the end of 4th week. 

Required 3: Knowledge Management Systems: A Text Mining Perspective, by HsinChun Chen, It is available at its entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/chenKMSi.pdf  (a small booklet containing only 50 5"x7" pages)  Please finish reading it by the end of 9th week.

Optional 1: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales, by Dan Sullivan, Publisher: Wiley, 2001 (ISBN: 0471399590)

Optional 2: Information Retrieval (2nd Ed), by C. J. van Rijsbergen, Publisher: London: Butterworths, 1979.  It is available at its entirety at the author's web site: http://www.dcs.gla.ac.uk/Keith/Preface.html

Supplemental readings are available through the links on the course schedule below.

 

Assignments, Grading, and Due Dates (subject to change):

Note:  All written assignments should be word-processed and posted on web board as WORD or PDF attachments.  Do not submit hard-copies.  The group presentation slides should be posted as PowerPoint attachments.  Remember to specify your name on it, but do not list your social security number!!  Most importantly, all written assignments should follow proper citation style!

1. Participation (attendance, self-intro, group competition, and webboard contributions)  10%

2. An Online Search Log                                         10%

3. Two Retrieval Experiments                                  20 %

4. Semester Project                                                 30%

5. Final Exam                                                          30%

          Total                                                                     100%

Following is an overview to the assignments and projects; more details will be provided in class.  

1. Participation (10%): 

 

2. An Online Search Log (10%):  This assignment appears to be easy and straightforward.  However, if you carefully record the whole experience, you shall find many IR problems during the whole exercise.  The instructor will show you how this exercise relates to other IR topics later in the semester.

§        Part I: Search and record any useful information for the task assigned.  There is no limitations on the number of resources you can use, as long as they are web search engines, web directories, or electronic journal databases available at NJIT library web site.  (The instructor might not have access to other resources.)

Task: Please find necessary information on "how to setup a small network for three desktops running Windows XP at home?"  (Note: 1. The sentence describing the task should not be the query you use to search for information.  You should come up with your own queries.  Through this experience, you would learn why it is difficult for users to find information.  2. Even if you know the answer without the need to search for information, please pretend you don't know the answer and go on with the assignment.) 

For each search session, be sure to record the following items along the search process:

1.      The search engine/directory/electronic database you used 

2.      The search query you entered.  (If you entered several queries with one same search tools, treat them as different sessions.)

3.      Number of search hits returned.

4.      In the top twenty returned documents, find out the number of documents/hits actually relevant to your query and their URLs (or document titles, if an electronic database is used). 

5.      Among all returned hits, what is the number of returned documents you browsed through?  

Note:

§         If you use more than one query for a particular search engine, please record item 2, 3, 4, and 5, repeatedly.

§         If you do not find useful information and decide to use other resources, please repeat the above steps.

§         Clearly mark the URL or title of the best site/best paper obtained from your search.   

§         Part II: Read "What Do People Want from IR" by Croft.  Based on your experience as a web search engines and/or text databases user, write a short paper called "What Do I need from IR?"  List at least, but not limited to, 3 items.  You do not need to pick items from Croft's paper.  For each item, please provide examples/frustrations you have experienced from the first part of this assignment.  Limit your response to 500 words.

3. Retrieval Experiments (20%):  Note that each student will work individually and will use a unique set of queries; therefore, none of the results of your experiments will be identical.  Results of both retrieval experiments should be posted on your own web site.  Note: To gain access to the test collections and the IR system, all students should get an AFS account.  Please follow instructions on http://newaccount.njit.edu/.  The instructor will NOT help you to obtain the account.  Please direct all your questions to Computing Help Desk 973-596-2900.

 

Assigned queries for your RE I and II 

 

(will be filled out soon after withdraw deadline.)

 

Q1

Ahmad

Q2

Bushell

Q3

Chaar

Q4

Conover

Q5

The Instructor

Q6

Dougan

Q7

Hutchinson

Q8

Kadzielawa

Q9

Kamat

Q10

Karandikar

Q11

Kong

Q12

Moeller

Q13

Mount

Q14

Ojiem

Q15

Ruymann

Q16

Sequeira

Q17

Shah

Q18

Sharma

Q19

Tong

Q20

Tsai

Q24

Williams, J

Q25

Williams, M

Q26

Xie

Q27

Q28

Q29

Q30

Q31

Q32

 

Q33

 

Q34

 

Q35

 

 

§         RE I (10%): You are required to operate an IR system, a test collection, and a query, and then run several retrieval experiments using Arrow of BOW toolkit (see Resources section of the syllabus).  The results of your Retrieval Experiment 1 should look like this page: http://web.njit.edu/~wu/cis634/cis634re1.html 

§         RE II (10%): You are required to create a small document collection based on the results from your RE1, a document-term matrix, and perform documents classification/clustering analysis using Rainbow and Nenet (see Resources below). The results of your Retrieval Experiment 2 should look like this page: http://web.njit.edu/~wu/cis634/cis634re2.html    

4. Semester Group Project (30%): Two options:  case analysis project or programming project.  Instructions here.

Option 1: Case analysis

You are required to make up a fictional client and its business problems.

§         Part I (3%): A proposal define your client, its business problems and at least 3 possible sources of documents. 

§         Part II (22%): Use text mining software programs to collect, pre-process, index, mine and analyze the documents you collected.   Deliverables: 1. A CD containing all documents you collected; 2. Power Point slides; 3. A final report containing a. 1 page execute summary, b. main report: tools used and screen shots of outputs, c. analysis and recommended solutions to the client's business problems. 

§         Part III (5%): Presentation 

Option 2: Programming Project 

You can design your own project with the instructor's approval.  Sample choices are: text retrieval systems, information extraction systems, automatic summarizations, etc.   

§         Part I (5%): A proposal describing your project, including systems functions and tools that you will use to develop the system.  Please use flow charts to demonstrate your idea.

§         Part II (22%): A. System development using any programming language that your group is most familiar with.  (You will have all the necessary IR concepts and theories from lectures.  The instructor will not spend time on discussing the implementation.  For example, the instructor will discuss what is an inverted file and how it can be used for automatic indexing and retrieval, but not how to use Java or C++ to generate the inverted file.)  B. System design and documentation (including flow charts and user manual), and the evaluation of system performance. 

§         Part II (5%): Presentation

5. Take Home Exam (30%):  it covers all previous lectures and two booklets written by Dr. HsinChun Chen. 

 

Resources  

1. from The Information Retrieval Group at University at Glasgow

§         A list of stopped wordshttp://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words  

§         Test Collections: http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/ 

2. from McCallum, Andrew Kachites, Computer Science Dept, Carnegie Mellon University

§         "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering," 1996.  http://www.cs.cmu.edu/~mccallum/bow.

§         Arrow help file

§         Crossbow help file

§         Rainbow help file

3. from Neural Networks Research Centre, HELSINKI UNIVERSITY OF TECHNOLOGY

§         SOM_PAK

§         Nenet

4. from AI Lab, University of Arizona

§         Web Spiders

Schedule (subject to change, last updated Aug/10/2003)

Week

Topic

Readings

Due Dates

1

Course Logistic and Overview

IR Academic Resources

Document and Query Forms

"What do people want from IR" by Croft  

Korfhage Ch1-2

Van Rijsbergen Ch1-2

Self-introduction on webboard due.  (Sept 6)

2

Data Compression

Query Structures 

Korfhage Ch2 - 3  

 

3

Matching Process

Text Analysis

Korfhage Ch 3-5

Rijsbergen Ch2 (The Zipf's Law Part only)

On-line search log due (Sept 20)

4

TF.IDF

Basics of UNIX

Experiencing an IR System

Retrieval Experiment I Instructions

Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering

Finish reading Dr. Chen's book "Trailblazing a Path Towards Knowledge and Transformation"

Document Similarity

Document and Concept Classification/Clustering Techniques

Overview of Final Projects

Rijsbergen Ch3

On the Automatic Classification of Accounting concepts: Preliminary Results of the Statistical Analysis of Term-Document Frequencies, by Gangolly and Wu, published in New Review of Applied Expert Systems and Emerging Technologies, pp 81-88, v 6, 2000.

Design competition: Part I System Designs Due (Oct 04)

6

Text Mining using Neural Networks

SOM and Nenet package overview

Retrieval Experiment II Instructions

AI LAB: A Scalable Self-Organizing Map Algorithm for Textual Classification: A Neural Network Approach to Automatic Thesaurus Generation (Roussinov & Chen, 1998) Design competition: Part II Votes Due (Oct 11)

Retrieval Exp I due (Oct 11)

7

Document Warehousing Sullivan Ch1 Semester Project Proposal due (Oct 18)

8

Information Extraction

Text Mining Applications

Sullivan Ch13

1. Information Extraction: Techniques and Challenges, by Ralph Grishman

2. An interactive system for finding complementary literatures: a stimulus to scientific discovery, Artificial Intelligence, Volume 91, Issue 2, April 1997, Pages 183-203 Don R. Swanson and Neil R. Smalheiser

3. Combining Data and Text Mining Techniques For Analyzing Financial Reports

Retrieval Exp II due (Oct 25)

 9

 Retrieval Effectiveness Measures

Output Presentation

Korfhage Ch 8, 11

Rijsbergen Ch 7 

Finish reading Dr. Chen's book "Knowledge Management Systems: A Text Mining Perspective"

10

IR Effectiveness Improvement Techniques: Relevance Feedback, Query Expansion, Local Context Analysis, and Word-Sense Disambiguation

Korfhage Ch 9

 

11

User Profiles

Alternative Retrieval Techniques

Korfhage Ch 6, 10

S. Chakrabarti, B. Dom and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD 1998.  http://www.cs.toronto.edu/~wtjioe/mining/hypertext.pdf

J. Kleinberg. Authoritative sources in a hyperlinked environment.
http://www.cs.cornell.edu/home/kleinber/auth.pdf

 

12

Final Exam  covering all previous lectures and the 2 booklets by Dr. HsinChun Chen

The exam will be e-mailed to you and also posted on the webboard by 5pm Nov/21.  It's due back via e-mail to Prof Wu by 12pm Nov/22 or be postmarked on Nov/22.

 

13

Natural Language Processing Little Words Can Make a Big Difference for Text Classification by Ellen Riloff  

14

Semester Project Presentation Part I: Questions

Each group, after reviewing other groups' work, has to ask 1 question for each of the remaining groups by replying to other groups' final project webboard messages.  (Meaning there are 7 groups, your group has to generate 6 questions.)  The questions have to be very carefully chosen so that they point out the problems of another group or things that are not clear to you.)

All Semester Projects and Presentation Slides DEADLINE: midnight Dec 4.

Questions for other groups are due: Dec 10

15

Semester Project Presentation Part II: Answers 

All groups have to answer questions assigned to you by Dec/13 (Sat).  However, you should answer them as soon as they are posted.   Divide work among your group members.  The quality of your responses is important.

Answers are due Dec/13.