CIS 634 INFORMATION RETRIEVAL
Syllabus for Distance Learning Section
Fall 2004
(subject to change)
Instructor for the DL section and Course Coordinator: Yi-Fang Wu, Ph.D.
Instructor for the FTF section: Mr. Quanzhi Li
Face_To_Face Class Location & Time: KUPF 104, 6:00pm-9:05pm, Wednesday
Contact the Instructors:
for Dr. Wu:
for Quanzhi Li:
By e-mail: QL23@njit.edu
By Phone: (973) 596 - 5655
Class Web Board: http://webboard.njit.edu:8080/~F2004CIS634-101,851/
Outdated Catalog Description (from NJIT web site)
Prerequisites: CIS 631. Covers the concepts and principles of information retrieval systems design. Techniques essential for building text databases, document processing systems, office automation systems, and other advanced information management systems.
Introduction
Information retrieval (IR) is a fast-changing field concerning the
representation, organization, storage, and retrieval of information items. A broader
definition of data types of information items includes text,
numbers, multimedia and more, while a narrower definition includes only
text. The instructor recognizes the fact that there are courses offered
for other types of retrievals, so this course will focus mostly on text
retrieval. Even so, there are still many topics to be
covered.
The importance of text retrieval is obvious. Most business data is in
text format. However, most text is not as well organized as numerical data
stored in commercial databases. This, along with linguistic complexity,
has caused the low performance of text retrieval. To achieve high
retrieval effectiveness, techniques such as automatic indexing, query expansion,
local context analysis, information extraction, text mining, and many more have
been developed to overcome problems in IR. As an information professional,
you should know how to use
these techniques to organize, store, and retrieve text effectively and
efficiently.
This course is designed to address both theories and practices of IR.
It consists of two parts: introduction
to IR theories, and hands-on
experience of retrieval using a readily available system. Fall 2003 new added topics: document warehouse and text
mining.
Please note: office automation systems and advanced information management
systems listed in the course catalog for CIS 634 are not covered in this
course.
Class Conduct
This is a graduate course. As an NJIT graduate student, you must follow the Institute's academic rules. Please refer to student handbook for details. Specifically:
§ Academic dishonesty is not allowed and will be reported to Dean of Student Services.
§ All written assignments will be sent to http://www.turnitin.com/, a plagiarism prevention system, for verification.
§ Late assignments will be penalized 25% a day.
§ You are required to check webboard announcements at least 3 times a week.
§
When posting messages on class web board, respect others
and be considerate.
§ You are expected to attend all class meetings ON TIME. Class attendance will be recorded every meeting. Poor attendance will negatively affect your semester grade.
§ If you have to drop the class after you are assigned to a project group, please notify the instructor and your group members immediately.
§ Snow closing information will be available on the NJIT web site. The instructor does not make the decision.
Textbooks
Required 1: Information Storage and Retrieval, by Robert R. Korfhage, Publisher: John Wiley & Sons (ISBN: 0471143383)
Required 2: Trailblazing a Path Towards Knowledge and Transformation, by HsinChun Chen, It is available at its entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/Chen2Book.pdf (a small booklet containing only 80 5"x7" pages) Please finish reading it by the end of 4th week.
Required 3: Knowledge Management
Systems: A Text Mining Perspective, by
HsinChun Chen, It is available at its
entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/chenKMSi.pdf (a
small booklet containing only 50 5"x7" pages) Please finish reading it by the end of 9th
week.
Optional 1: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales, by Dan Sullivan, Publisher: Wiley, 2001 (ISBN: 0471399590)
Optional 2: Information Retrieval (2nd Ed), by C. J. van
Rijsbergen, Publisher: London: Butterworths, 1979. It is available at its entirety at the
author's web site: http://www.dcs.gla.ac.uk/Keith/Preface.html
Supplemental readings are available through the links on the course schedule below.
Assignments, Grading,
and Due Dates (subject to change):
Note: All written assignments should be word-processed and posted on web board as WORD or PDF attachments. Do not submit hard-copies. The group presentation slides should be posted as PowerPoint attachments. Remember to specify your name on it, but do not list your social security number!! Most importantly, all written assignments should follow proper citation style!
1. Participation (attendance, self-intro, group competition, in-class and webboard contributions) 10%
Self-introduction and reply to one other student's intro (1%) due midnight, Sept 11.
IR system design competition (4%) Design due: Oct/02. Votes due: Oct/09. See detailed instructions below.
In-class and webboard participation (5%)
2. An Online Search Log 5%
due: midnight Sept 18.
3. Two Retrieval Experiments 20 %
RE I (10%): due midnight Oct 09
RE II (10%): due midnight Oct 30
4. Semester Project 35%
Proposal (5%): due midnight October 16.
Full paper (25%) or (for programming projects: the developed system 20% and documentation 5%): due midnight Dec/07
Presentation (5%): December 08.
5. Final Exam 30%
Total 100%
Following is an
overview to the assignments and
projects; more details will be
provided in class.
1. Participation (10%):
The self- introduction on the web board. Please follow the instructions on "Introductions" conference on class webboard. Reply to and get to know at least one of the classmates.(1%)
IR System Design competition.(4%)
This activity consists of 2 parts: design and voting.
Design Part:
Instructions: You might want to use visio or MS word to draw the flow charts. Remember to attach files.
Remember one very important thing: Always think about how large the web is and complicated it is to generate index for web documents. So that the way a search engine works is:
1. document collection: use spiders or crawlers to collect documents and gather basic info such as URL, title, last updated date, etc.
2. indexing: parse web documents to get a list of unique terms and therefore, a final list of index. several steps: stop words removal, stemming, and Zipf's law (high and low frequent words removal).
all 2 steps have to be done for the whole document collection before the search system can be used by users.
3. search interface: accepts queries from users.
4. retrieval: the system compares the query with index for each document in the document collection.
4-a. simple Boolean queries: the retrieval component will check the presence and/or absence of query terms in the document index.
4-b. similarity using distance or angular based method: calculation between query vector and all document vectors are required.
Voting Part:
For voting, please select 3 best designs from those by DL students. the most important criterion is if the design works. Other things like efficiency is used to rank the best designs.
In-class participation and Webboard participation: The instructor will post topics for on-line discussions on web board. Please respond to them. Your are also welcome to provide your points of views to other students' questions as well. Respect others and be considerate when respond to postings!! Positive and constructive postings are examples of "good participations," not the number of postings.(5%)
2. An Online Search Log
(5%):
§ Part I: Search and record any useful information for the task assigned. There is no limitations on the number of resources you can use, as long as they are web search engines, web directories, or electronic journal databases available at NJIT library web site. (The instructor might not have access to other resources.)
Task: Please find necessary information on "how to setup a small network for three desktops running Windows XP at home?" (Note: 1. The sentence describing the task should not be the query you use to search for information. You should come up with your own queries. Through this experience, you would learn why it is difficult for users to find information. 2. Even if you know the answer without the need to search for information, please pretend you don't know the answer and go on with the assignment.)
For each search session, be sure to record the following items along the search process:
1. The search engine/directory/electronic database you used
2. The search query you entered. (If you entered several queries with one same search tools, treat them as different sessions.)
3. Number of search hits returned.
4. In the top twenty returned documents, find out the number of documents/hits actually relevant to your query and their URLs (or document titles, if an electronic database is used).
5. Among all returned hits, what is the number of returned documents you browsed through?
Note:
§ If you use more than one query for a particular search engine, please record item 2, 3, 4, and 5, repeatedly.
§ If you do not find useful information and decide to use other resources, please repeat the above steps.
§ Clearly mark the URL or title of the best site/best paper obtained from your search.
§
Part II: Read "What Do People Want
from IR" by Croft. Based on
your experience as a web search engines and/or text databases user, write a
3. Retrieval Experiments
(20%):
Assigned queries for your RE I and II
(will be filled out soon after withdraw deadline.)
Q1 Jimin R. Bhuptani Q2 Keerti K. Chivakula Q3 Paul L. Cihak Q4 Arun Dabas Q5 Dr. Wu Q6 Mircea Dascaloiu Q7 Thomas J. German Q8 Mythili Jammalamadaka Q9 Latha Kalidindi Q10 Heather L. Kile Q11 Mineshkumar D. Lad Q12 Derek K. Linebarger Q13 Tejal B. Mistry Q14 Syed Salman Mohsin Q15 Prerak J. Parikh Q16 Mayur R. Patel Q17 Paavan A. Pujara Q18 William Michael Rosellini Q19 Heny Shah Q20 Jeffrey L. Spector Q24 Sandhya Srinivasan Q25 Richard Wang Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35
§
RE I (10%): You are required to operate an IR
system, a test collection, and a query, and then run several retrieval
experiments using Arrow of BOW toolkit (see Resources section of the
syllabus).
§
RE II (10%): You are required to create a small document collection based on the
results from your RE1, a document-term matrix, and perform documents classification/clustering analysis
using Rainbow and Nenet (see Resources below). The
4. Semester Group Project (35%): Two options: case analysis project or programming project. Instructions here.
Option 1: Case analysis:
You are required to make up a fictional client and its business problems.
§
Part I (5%):
A proposal define your client, its business problems and
at least 3 possible sources of documents.
§ Part II (25%): Use text mining software programs to collect, pre-process, index, mine and analyze the documents you collected. Deliverables: 1. A CD containing all documents you collected; 2. Power Point slides; 3. A final report containing a. 1 page execute summary, b. main report: tools used and screen shots of outputs, c. analysis and recommended solutions to the client's business problems.
§ Part III (5%): Presentation
Option 2: Programming Project
You can design your own project with the instructor's approval. Sample choices are: text retrieval systems, information extraction systems, automatic summarizations, etc.
§ Part I (5%): A proposal describing your project, including systems functions and tools that you will use to develop the system. Please use flow charts to demonstrate your idea.
§ Part II (25%): A. System development using any programming language that your group is most familiar with. (You will have all the necessary IR concepts and theories from lectures. The instructor will not spend time on discussing the implementation. For example, the instructor will discuss what is an inverted file and how it can be used for automatic indexing and retrieval, but not how to use Java or C++ to generate the inverted file.) B. System design and documentation (including flow charts and user manual), and the evaluation of system performance.
§ Part II (5%): Presentation
5. Final Exam (30%):
Resources
1. from The Information Retrieval Group at University at Glasgow:
§ A list of stopped words: http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
§ Test Collections: http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/
2. from McCallum, Andrew Kachites, Computer Science Dept, Carnegie Mellon University
§ "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering," 1996. http://www.cs.cmu.edu/~mccallum/bow.
3. from Neural Networks Research Centre, HELSINKI UNIVERSITY OF TECHNOLOGY
§ SOM_PAK
§ Nenet
4. from AI Lab, University of Arizona
Schedule (subject to
change, last updated August/27/2004)
Week |
Topic |
Readings |
Due Dates |
1 09/01 |
Course
Logistic and Overview IR Academic Resources Document and Query Forms |
"What do
people want from IR" by Croft Korfhage Ch1-2 Van Rijsbergen Ch1-2 |
|
2 09/8 |
Data Compression Query
Structures |
Korfhage Ch2 - 3 |
|
3 09/15 |
Matching
Process Text Analysis |
Korfhage Ch 3-5 Rijsbergen Ch2 (The Zipf's Law Part only) |
On-line search log due (Sept 18) |
4 09/22 |
Basics of UNIX Experiencing
an IR System Retrieval Experiment I Instructions |
Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering |
Finish reading Dr. Chen's book "Trailblazing a Path Towards Knowledge and Transformation" |
5 09/29 |
Document Similarity Document and Concept Classification/Clustering Techniques Overview of Final Projects |
Rijsbergen Ch3 On the Automatic Classification of Accounting concepts: Preliminary Results of the Statistical Analysis of Term-Document Frequencies, by Gangolly and Wu, published in New Review of Applied Expert Systems and Emerging Technologies, pp 81-88, v 6, 2000. |
IR system design competition Due Oct/02. |
6 10/06 |
Text Mining using Neural Networks
SOM and Nenet package overview Retrieval Experiment II Instructions |
AI LAB: A Scalable Self-Organizing Map Algorithm for Textual Classification: A Neural Network Approach to Automatic Thesaurus Generation (Roussinov & Chen, 1998) | Both: 1. Retrieval Exp I due, and 2. Votes for best IR system design Due (Oct 09) |
7 10/13 |
Document Warehousing | Sullivan Ch1 | |
8 10/20 |
Information Extraction
Text Mining Applications |
1. Sullivan Ch13
2. Information Extraction: Techniques and Challenges, by Ralph Grishman 4. Combining Data and Text Mining Techniques For Analyzing Financial Reports |
Semester Project Proposal due (Oct 16) |
9 10/27 |
Retrieval Effectiveness Measures
Output
Presentation |
Korfhage Ch 8, 11 Rijsbergen Ch 7 |
Retrieval Exp II due (Oct 30) Finish reading Dr. Chen's book "Knowledge Management Systems: A Text Mining Perspective" |
10 11/03 |
IR Effectiveness Improvement Techniques: Relevance Feedback, Query Expansion, Local Context Analysis, and Word-Sense Disambiguation |
Korfhage Ch
9 |
|
11 11/10 |
User Profiles Alternative Retrieval Techniques |
Korfhage Ch 6, 10 S. Chakrabarti, B. Dom and P. Indyk. Enhanced hypertext categorization
using hyperlinks. Proceedings of ACM SIGMOD 1998. http://www.cs.toronto.edu/~wtjioe/mining/hypertext.pdf |
|
12 11/17 |
Natural Language Processing |
Little Words Can Make a Big Difference for Text Classification by Ellen Riloff |
|
13 11/24 |
Thanksgiving. No Class. | ||
14 12/01 |
Final Exam
|
covering all previous lectures and the 2 booklets by Dr. HsinChun Chen |
|
15 12/08 |
Semester Project Presentation |
Final Project Due midnight Dec/07 |