CIS 634 Information Retrieval

CIS 634 INFORMATION RETRIEVAL

Syllabus for Distance Learning Section

Fall 2004

(subject to change)

Instructor for the DL section and Course Coordinator: Yi-Fang Wu, Ph.D. (a.k.a. Brook Wu)

Instructor for the FTF section: Mr. Quanzhi Li

Face_To_Face Class Location & Time: KUPF 104, 6:00pm-9:05pm, Wednesday

Contact the Instructors:

for Dr. Wu:

By e-mail (fastest way to get a response): wu@njit.edu

for Quanzhi Li:

In person: GITC 4215 (eARTH Lab), Hours: 4:00pm-6:00pm, Monday, and 3:00pm-5:30pm, Wednesday
By e-mail: QL23@njit.edu
By Phone: (973) 596 - 5655

Class Web Board: http://webboard.njit.edu:8080/~F2004CIS634-101,851/

Outdated Catalog Description (from NJIT web site)

Information Retrieval 3 credits

Prerequisites: CIS 631. Covers the concepts and principles of information retrieval systems design. Techniques essential for building text databases, document processing systems, office automation systems, and other advanced information management systems.

Introduction (by the instructor)

Information retrieval (IR) is a fast-changing field concerning the representation, organization, storage, and retrieval of information items. A broader definition of data types of information items includes text, numbers, multimedia and more, while a narrower definition includes only text. The instructor recognizes the fact that there are courses offered for other types of retrievals, so this course will focus mostly on text retrieval. Even so, there are still many topics to be covered.

The importance of text retrieval is obvious. Most business data is in text format. However, most text is not as well organized as numerical data stored in commercial databases. This, along with linguistic complexity, has caused the low performance of text retrieval. To achieve high retrieval effectiveness, techniques such as automatic indexing, query expansion, local context analysis, information extraction, text mining, and many more have been developed to overcome problems in IR. As an information professional, you should know how to use these techniques to organize, store, and retrieve text effectively and efficiently.

This course is designed to address both theories and practices of IR. It consists of two parts: introduction to IR theories, and hands-on experience of retrieval using a readily available system. Fall 2003 new added topics: document warehouse and text mining.

Please note: office automation systems and advanced information management systems listed in the course catalog for CIS 634 are not covered in this course.

Class Conduct and Attendance

This is a graduate course. As an NJIT graduate student, you must follow the Institute's academic rules. Please refer to student handbook for details. Specifically:

§ Academic dishonesty is not allowed and will be reported to Dean of Student Services.

§ All written assignments will be sent to http://www.turnitin.com/, a plagiarism prevention system, for verification.

§ Late assignments will be penalized 25% a day.

§ You are required to check webboard announcements at least 3 times a week.

§ When posting messages on class web board, respect others and be considerate.

§ You are expected to attend all class meetings ON TIME. Class attendance will be recorded every meeting. Poor attendance will negatively affect your semester grade.

§ If you have to drop the class after you are assigned to a project group, please notify the instructor and your group members immediately.

§ Snow closing information will be available on the NJIT web site. The instructor does not make the decision.

Textbooks

Required 1: Information Storage and Retrieval, by Robert R. Korfhage, Publisher: John Wiley & Sons (ISBN: 0471143383)

Required 2: Trailblazing a Path Towards Knowledge and Transformation, by HsinChun Chen, It is available at its entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/Chen2Book.pdf (a small booklet containing only 80 5"x7" pages) Please finish reading it by the end of 4th week.

Required 3: Knowledge Management Systems: A Text Mining Perspective, by HsinChun Chen, It is available at its entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/chenKMSi.pdf (a small booklet containing only 50 5"x7" pages) Please finish reading it by the end of 9th week.

Optional 1: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales, by Dan Sullivan, Publisher: Wiley, 2001 (ISBN: 0471399590)

Optional 2: Information Retrieval (2^nd Ed), by C. J. van Rijsbergen, Publisher: London: Butterworths, 1979. It is available at its entirety at the author's web site: http://www.dcs.gla.ac.uk/Keith/Preface.html

Supplemental readings are available through the links on the course schedule below.

Assignments, Grading, and Due Dates (subject to change):

Note: All written assignments should be word-processed and posted on web board as WORD or PDF attachments. Do not submit hard-copies. The group presentation slides should be posted as PowerPoint attachments. Remember to specify your name on it, but do not list your social security number!! Most importantly, all written assignments should follow proper citation style!

1. Participation (attendance, self-intro, group competition, in-class and webboard contributions) 10%

Self-introduction and reply to one other student's intro (1%) due midnight, Sept 11.
IR system design competition (4%) Design due: Oct/02. Votes due: Oct/09. See detailed instructions below.
In-class and webboard participation (5%)

2. An Online Search Log 5%

due: midnight Sept 18.

3. Two Retrieval Experiments 20 %

RE I (10%): due midnight Oct 09
RE II (10%): due midnight Oct 30

4. Semester Project 35%

Proposal (5%): due midnight October 16.
Full paper (25%) or (for programming projects: the developed system 20% and documentation 5%): due midnight Dec/07
Presentation (5%): December 08.

5. Final Exam 30%

Total 100%

Following is an overview to the assignments and projects; more details will be provided in class.

1. Participation (10%):

The self- introduction on the web board. Please follow the instructions on "Introductions" conference on class webboard. Reply to and get to know at least one of the classmates.(1%)
IR System Design competition.(4%)

This activity consists of 2 parts: design and voting.

Design Part:

Instructions: You might want to use visio or MS word to draw the flow charts. Remember to attach files.
Remember one very important thing: Always think about how large the web is and complicated it is to generate index for web documents. So that the way a search engine works is:

1. document collection: use spiders or crawlers to collect documents and gather basic info such as URL, title, last updated date, etc.
2. indexing: parse web documents to get a list of unique terms and therefore, a final list of index. several steps: stop words removal, stemming, and Zipf's law (high and low frequent words removal).

all 2 steps have to be done for the whole document collection before the search system can be used by users.

3. search interface: accepts queries from users.

4. retrieval: the system compares the query with index for each document in the document collection.

4-a. simple Boolean queries: the retrieval component will check the presence and/or absence of query terms in the document index.
4-b. similarity using distance or angular based method: calculation between query vector and all document vectors are required.

Voting Part:

For voting, please select 3 best designs from those by DL students. the most important criterion is if the design works. Other things like efficiency is used to rank the best designs.

In-class participation and Webboard participation: The instructor will post topics for on-line discussions on web board. Please respond to them. Your are also welcome to provide your points of views to other students' questions as well. Respect others and be considerate when respond to postings!! Positive and constructive postings are examples of "good participations," not the number of postings.(5%)

2. An Online Search Log (5%): This assignment appears to be easy and straightforward. However, if you carefully record the whole experience, you shall find many IR problems during the whole exercise. The instructor will show you how this exercise relates to other IR topics later in the semester.

§ Part I: Search and record any useful information for the task assigned. There is no limitations on the number of resources you can use, as long as they are web search engines, web directories, or electronic journal databases available at NJIT library web site. (The instructor might not have access to other resources.)

Task: Please find necessary information on "how to setup a small network for three desktops running Windows XP at home?" (Note: 1. The sentence describing the task should not be the query you use to search for information. You should come up with your own queries. Through this experience, you would learn why it is difficult for users to find information. 2. Even if you know the answer without the need to search for information, please pretend you don't know the answer and go on with the assignment.)

For each search session, be sure to record the following items along the search process:

1. The search engine/directory/electronic database you used

2. The search query you entered. (If you entered several queries with one same search tools, treat them as different sessions.)

3. Number of search hits returned.

4. In the top twenty returned documents, find out the number of documents/hits actually relevant to your query and their URLs (or document titles, if an electronic database is used).

5. Among all returned hits, what is the number of returned documents you browsed through?

Note:

§ If you use more than one query for a particular search engine, please record item 2, 3, 4, and 5, repeatedly.

§ If you do not find useful information and decide to use other resources, please repeat the above steps.

§ Clearly mark the URL or title of the best site/best paper obtained from your search.

§ Part II: Read "What Do People Want from IR" by Croft. Based on your experience as a web search engines and/or text databases user, write a short paper called "What Do I need from IR?" List at least, but not limited to, 3 items. You do not need to pick items from Croft's paper. For each item, please provide examples/frustrations you have experienced from the first part of this assignment. Limit your response to 500 words.

3. Retrieval Experiments (20%): Note that each student will work individually and will use a unique set of queries; therefore, none of the results of your experiments will be identical. Results of both retrieval experiments should be posted on your own web site. Note: To gain access to the test collections and the IR system, all students should get an AFS account. Please follow instructions on http://newaccount.njit.edu/. The instructor will NOT help you to obtain the account. Please direct all your questions to Computing Help Desk 973-596-2900.

Assigned queries for your RE I and II

(will be filled out soon after withdraw deadline.)

Q1 Jimin R. Bhuptani

Q2 Keerti K. Chivakula

Q3 Paul L. Cihak

Q4 Arun Dabas

Q5 Dr. Wu

Q6 Mircea Dascaloiu

Q7 Thomas J. German

Q8 Mythili Jammalamadaka

Q9 Latha Kalidindi

Q10 Heather L. Kile

Q11 Mineshkumar D. Lad

Q12 Derek K. Linebarger

Q13 Tejal B. Mistry

Q14 Syed Salman Mohsin

Q15 Prerak J. Parikh

Q16 Mayur R. Patel

Q17 Paavan A. Pujara

Q18 William Michael Rosellini

Q19 Heny Shah

Q20 Jeffrey L. Spector

Q24 Sandhya Srinivasan

Q25 Richard Wang

Q26

Q27

Q28

Q29

Q30

Q31

Q32

Q33

Q34

Q35

§ RE I (10%): You are required to operate an IR system, a test collection, and a query, and then run several retrieval experiments using Arrow of BOW toolkit (see Resources section of the syllabus). The results of your Retrieval Experiment 1 should look like this page: http://www-ec.njit.edu/~wu/cis634/cis634re1.html

§ RE II (10%): You are required to create a small document collection based on the results from your RE1, a document-term matrix, and perform documents classification/clustering analysis using Rainbow and Nenet (see Resources below). The results of your Retrieval Experiment 2 should look like this page: http://web.njit.edu/~wu/cis634/cis634re2.html

4. Semester Group Project (35%): Two options: case analysis project or programming project. Instructions here.

Option 1: Case analysis:

You are required to make up a fictional client and its business problems.

§ Part I (5%): A proposal define your client, its business problems and at least 3 possible sources of documents.

§ Part II (25%): Use text mining software programs to collect, pre-process, index, mine and analyze the documents you collected. Deliverables: 1. A CD containing all documents you collected; 2. Power Point slides; 3. A final report containing a. 1 page execute summary, b. main report: tools used and screen shots of outputs, c. analysis and recommended solutions to the client's business problems.

§ Part III (5%): Presentation

Option 2: Programming Project

You can design your own project with the instructor's approval. Sample choices are: text retrieval systems, information extraction systems, automatic summarizations, etc.

§ Part I (5%): A proposal describing your project, including systems functions and tools that you will use to develop the system. Please use flow charts to demonstrate your idea.

§ Part II (25%): A. System development using any programming language that your group is most familiar with. (You will have all the necessary IR concepts and theories from lectures. The instructor will not spend time on discussing the implementation. For example, the instructor will discuss what is an inverted file and how it can be used for automatic indexing and retrieval, but not how to use Java or C++ to generate the inverted file.) B. System design and documentation (including flow charts and user manual), and the evaluation of system performance.

§ Part II (5%): Presentation

5. Final Exam (30%): it covers all previous lectures and two booklets written by Dr. HsinChun Chen.

Resources

1. from The Information Retrieval Group at University at Glasgow:

§ A list of stopped words: http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words

§ Test Collections: http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/

2. from McCallum, Andrew Kachites, Computer Science Dept, Carnegie Mellon University

§ "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering," 1996. http://www.cs.cmu.edu/~mccallum/bow.

§ Arrow help file

§ Crossbow help file

§ Rainbow help file

3. from Neural Networks Research Centre, HELSINKI UNIVERSITY OF TECHNOLOGY

§ SOM_PAK

§ Nenet

4. from AI Lab, University of Arizona

§ Web Spiders

Schedule (subject to change, last updated August/27/2004)

Week

Topic

Readings

Due Dates

09/01

Course Logistic and Overview

IR Academic Resources

Document and Query Forms

"What do people want from IR" by Croft

Korfhage Ch1-2

Van Rijsbergen Ch1-2

09/8

Data Compression

Query Structures

Korfhage Ch2 - 3

Self-introduction on webboard due. (Sept 11)

09/15

Matching Process

Text Analysis

Korfhage Ch 3-5

Rijsbergen Ch2 (The Zipf's Law Part only)

On-line search log due (Sept 18)

09/22

TF.IDF

Basics of UNIX

Experiencing an IR System

Retrieval Experiment I Instructions

Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering

Finish reading Dr. Chen's book "Trailblazing a Path Towards Knowledge and Transformation"

09/29

Document Similarity

Document and Concept Classification/Clustering Techniques

Overview of Final Projects

Rijsbergen Ch3

On the Automatic Classification of Accounting concepts: Preliminary Results of the Statistical Analysis of Term-Document Frequencies, by Gangolly and Wu, published in New Review of Applied Expert Systems and Emerging Technologies, pp 81-88, v 6, 2000.

IR system design competition Due Oct/02.

10/06

Text Mining using Neural Networks

SOM and Nenet package overview

Retrieval Experiment II Instructions

AI LAB: A Scalable Self-Organizing Map Algorithm for Textual Classification: A Neural Network Approach to Automatic Thesaurus Generation (Roussinov & Chen, 1998)

Both:

1. Retrieval Exp I due, and

2. Votes for best IR system design Due

(Oct 09)

10/13

Document Warehousing

Sullivan Ch1

10/20

Information Extraction

Text Mining Applications

1. Sullivan Ch13

2. Information Extraction: Techniques and Challenges, by Ralph Grishman

3. An interactive system for finding complementary literatures: a stimulus to scientific discovery, Artificial Intelligence, Volume 91, Issue 2, April 1997, Pages 183-203 Don R. Swanson and Neil R. Smalheiser

4. Combining Data and Text Mining Techniques For Analyzing Financial Reports

Semester Project Proposal due (Oct 16)

10/27

Retrieval Effectiveness Measures

Output Presentation

Korfhage Ch 8, 11

Rijsbergen Ch 7

Retrieval Exp II due (Oct 30)

Finish reading Dr. Chen's book "Knowledge Management Systems: A Text Mining Perspective"

11/03

IR Effectiveness Improvement Techniques: Relevance Feedback, Query Expansion, Local Context Analysis, and Word-Sense Disambiguation

Korfhage Ch 9

11/10

User Profiles

Alternative Retrieval Techniques

Korfhage Ch 6, 10

S. Chakrabarti, B. Dom and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD 1998. http://www.cs.toronto.edu/~wtjioe/mining/hypertext.pdf

J. Kleinberg. Authoritative sources in a hyperlinked environment.
http://www.cs.cornell.edu/home/kleinber/auth.pdf

11/17

Natural Language Processing

Little Words Can Make a Big Difference for Text Classification by Ellen Riloff

11/24

Thanksgiving. No Class.

12/01

Final Exam

covering all previous lectures and the 2 booklets by Dr. HsinChun Chen

12/08

Semester Project Presentation

Final Project Due midnight Dec/07

Q1	Jimin R. Bhuptani
Q2	Keerti K. Chivakula
Q3	Paul L. Cihak
Q4	Arun Dabas
Q5	Dr. Wu
Q6	Mircea Dascaloiu
Q7	Thomas J. German
Q8	Mythili Jammalamadaka
Q9	Latha Kalidindi
Q10	Heather L. Kile
Q11	Mineshkumar D. Lad
Q12	Derek K. Linebarger
Q13	Tejal B. Mistry
Q14	Syed Salman Mohsin
Q15	Prerak J. Parikh
Q16	Mayur R. Patel
Q17	Paavan A. Pujara
Q18	William Michael Rosellini
Q19	Heny Shah
Q20	Jeffrey L. Spector
Q24	Sandhya Srinivasan
Q25	Richard Wang
Q26
Q27
Q28
Q29
Q30
Q31
Q32
Q33
Q34
Q35