CIS 634 INFORMATION RETRIEVAL
Syllabus for Distance Learning Section
Fall 2003
(subject to change)
Instructor: Yi-Fang Wu,
Ph.D.
Class Web Board: http://webboard.njit.edu:8080/~F2003CIS634-851
Contact the Instructor:
· In person: GITC 5502, Hours: 11:30 - 13:00, Monday and 16:30-17:45, Friday.
· By e-mail (fastest way to get a response): wu@njit.edu
· By Phone: (973) 596 - 5285
Outdated Catalog Description (from NJIT web site)
Prerequisites: CIS 631. Covers the concepts and principles of information retrieval systems design. Techniques essential for building text databases, document processing systems, office automation systems, and other advanced information management systems.
Introduction
Information retrieval (IR) is a fast-changing field concerning the representation,
organization, storage, and retrieval of information
items. A broader definition of data types of information items includes text, numbers, multimedia and more,
while a narrower definition includes only text. The instructor recognizes
the fact that there are courses offered for other types of retrievals, so this
course will focus mostly on text retrieval. Even so, there are still many
topics to be covered.
The importance of text retrieval is obvious. Most business data is in
text format. However, most text is not as well organized as numerical
data stored in commercial databases. This, along with linguistic
complexity, has caused the low performance of text retrieval. To achieve
high retrieval effectiveness, techniques such as automatic indexing, query
expansion, local context analysis, information extraction, text mining, and
many more have been developed to overcome problems in IR. As an
information professional, you should know
how to use these techniques to organize, store, and retrieve text effectively
and efficiently.
This course is designed to address both theories and practices of IR.
It consists of two parts: introduction
to IR theories, and hands-on
experience of retrieval using a readily available system. Fall
2003 new added topics: document warehouse and text mining.
Please note: office automation systems and advanced information management
systems listed in the course catalog for CIS 634 are not covered in this
course.
Class Conduct
This is a graduate course. As an NJIT graduate student, you must follow the Institute's academic rules. Please refer to student handbook for details. Specifically:
§ Academic dishonesty is not allowed and will be reported to Dean of Student Services.
§ All written assignments will be sent to www.turnitin.com, a plagiarism prevention system, for verification.
§ Late assignments will be penalized 25% a day.
§ You are required to check webboard announcements at least 3 times a week.
§
When posting messages on class web board, respect
others and be considerate.
§ If you have to drop the class after you are assigned to a project group, please notify the instructor and your group members immediately.
§ Snow closing information will be available on the NJIT web site. The instructor does not make the decision.
Textbooks
Required 1: Information Storage and Retrieval, by Robert R. Korfhage, Publisher: John Wiley & Sons (ISBN: 0471143383)
Required 2: Trailblazing a Path Towards Knowledge and Transformation, by HsinChun Chen, It is available at its entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/Chen2Book.pdf (a small booklet containing only 80 5"x7" pages) Please finish reading it by the end of 4th week.
Required 3: Knowledge
Management Systems: A Text Mining Perspective,
by HsinChun Chen, It is available at its entirety at the author's web site:
http://ai.bpa.arizona.edu/go/download/chenKMSi.pdf (a
small booklet containing only
50 5"x7" pages) Please
finish reading it by the end of 9th week.
Optional 1: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales, by Dan Sullivan, Publisher: Wiley, 2001 (ISBN: 0471399590)
Optional 2: Information Retrieval (2nd Ed), by C. J. van
Rijsbergen, Publisher: London: Butterworths, 1979. It is available at its entirety at the author's web site: http://www.dcs.gla.ac.uk/Keith/Preface.html
Supplemental readings are available through the links on the course schedule below.
Assignments, Grading, and Due Dates (subject to
change):
Note: All written assignments should be word-processed and posted on web board as WORD or PDF attachments. Do not submit hard-copies. The group presentation slides should be posted as PowerPoint attachments. Remember to specify your name on it, but do not list your social security number!! Most importantly, all written assignments should follow proper citation style!
1. Participation (attendance, self-intro, group competition, and webboard contributions) 10%
Self-introduction and reply to one other student's intro (1%) due midnight, Sept 6.
IR system design competition (4%): design due midnight Oct 04 and votes are due midnight Oct 11 (can't vote for yourself).
Webboard participation (5%)
2. An Online Search Log 10%
due: midnight Sept 20.
3. Two Retrieval Experiments 20 %
RE I (10%): due midnight Oct 11
RE II (10%): due midnight Oct 25
4. Semester Project 30%
Proposal (3%): due midnight October 18.
Full paper (22%) or (for programming projects: the developed system 17% and documentation 5%): due midnight Dec/04
Presentation (5%):
Questions (2%) Each group, after reviewing other groups' work, has to ask 1 question for each of the remaining groups by replying to their final project webboard message. (Meaning if there are 10 groups, your group has to generate 9 questions.) The quality of your questions is important. Due: midnight December 10.
Answers (3%) Groups have less than 1 week to respond to questions. The quality of your responses is important. Due: midnight December 13.
5. Final Exam 30%
Take home exam.
Total 100%
Following is an
overview to the assignments and
projects; more details will be
provided in class.
1. Participation (10%):
The self- introduction on the web board. Please follow the instructions on "Introductions" conference on class webboard. Reply to and get to know at least one of the classmates.(1%)
IR system design competition.(4%)
Webboard participation: The instructor will post topics for on-line discussions on web board. Please respond to them. Your are also welcome to provide your points of views to other students' questions as well. Respect others and be considerate when respond to postings!! Positive and constructive postings are examples of "good participations," not the number of postings.(5%)
2. An Online Search Log (10%):
§ Part I: Search and record any useful information for the task assigned. There is no limitations on the number of resources you can use, as long as they are web search engines, web directories, or electronic journal databases available at NJIT library web site. (The instructor might not have access to other resources.)
Task: Please find necessary information on "how to setup a small network for three desktops running Windows XP at home?" (Note: 1. The sentence describing the task should not be the query you use to search for information. You should come up with your own queries. Through this experience, you would learn why it is difficult for users to find information. 2. Even if you know the answer without the need to search for information, please pretend you don't know the answer and go on with the assignment.)
For each search session, be sure to record the following items along the search process:
1. The search engine/directory/electronic database you used
2. The search query you entered. (If you entered several queries with one same search tools, treat them as different sessions.)
3. Number of search hits returned.
4. In the top twenty returned documents, find out the number of documents/hits actually relevant to your query and their URLs (or document titles, if an electronic database is used).
5. Among all returned hits, what is the number of returned documents you browsed through?
Note:
§ If you use more than one query for a particular search engine, please record item 2, 3, 4, and 5, repeatedly.
§ If you do not find useful information and decide to use other resources, please repeat the above steps.
§ Clearly mark the URL or title of the best site/best paper obtained from your search.
§
Part II: Read "What
Do People Want from IR" by Croft. Based on your experience as a web search engines and/or text
databases user, write a
3. Retrieval Experiments (20%):
Assigned queries for your RE I and II
(will be filled out soon after withdraw deadline.)
Q1 |
Ahmad |
Q2 |
Bushell |
Q3 |
Chaar |
Q4 |
Conover |
Q5 |
The Instructor |
Q6 |
Dougan |
Q7 |
Hutchinson |
Q8 |
Kadzielawa |
Q9 |
Kamat |
Q10 |
Karandikar |
Q11 |
Kong |
Q12 |
Moeller |
Q13 |
Mount |
Q14 |
Ojiem |
Q15 |
Ruymann |
Q16 |
Sequeira |
Q17 |
Shah |
Q18 |
Sharma |
Q19 |
Tong |
Q20 |
Tsai |
Q24 |
Williams, J |
Q25 |
Williams, M |
Q26 |
Xie |
Q27 |
|
Q28 |
|
Q29 |
|
Q30 |
|
Q31 |
|
Q32 |
|
Q33 |
|
Q34 |
|
Q35 |
§
RE I (10%): You are required to operate an IR
system, a test collection, and a query, and then run several retrieval
experiments using Arrow of BOW toolkit (see Resources section of the syllabus).
§
RE II (10%): You are required to create a small document collection based on the
results from your RE1, a document-term matrix, and perform documents classification/clustering analysis
using Rainbow and Nenet (see Resources below). The
4. Semester Group Project (30%): Two options: case analysis project or programming project. Instructions here.
Option 1: Case analysis:
You are required to make up a fictional client and its business problems.
§
Part I (3%): A
proposal define your client, its business problems and at least 3 possible sources of
documents.
§ Part II (22%): Use text mining software programs to collect, pre-process, index, mine and analyze the documents you collected. Deliverables: 1. A CD containing all documents you collected; 2. Power Point slides; 3. A final report containing a. 1 page execute summary, b. main report: tools used and screen shots of outputs, c. analysis and recommended solutions to the client's business problems.
§ Part III (5%): Presentation
Option 2: Programming Project
You can design your own project with the instructor's approval. Sample choices are: text retrieval systems, information extraction systems, automatic summarizations, etc.
§ Part I (5%): A proposal describing your project, including systems functions and tools that you will use to develop the system. Please use flow charts to demonstrate your idea.
§ Part II (22%): A. System development using any programming language that your group is most familiar with. (You will have all the necessary IR concepts and theories from lectures. The instructor will not spend time on discussing the implementation. For example, the instructor will discuss what is an inverted file and how it can be used for automatic indexing and retrieval, but not how to use Java or C++ to generate the inverted file.) B. System design and documentation (including flow charts and user manual), and the evaluation of system performance.
§ Part II (5%): Presentation
5. Take Home Exam
(30%):
Resources
1. from The Information Retrieval Group at University at Glasgow:
§ A list of stopped words: http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
§ Test Collections: http://www.dcs.gla.ac.uk/idom/ir_resources/test_collections/
2. from McCallum, Andrew Kachites, Computer Science Dept, Carnegie Mellon University
§ "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering," 1996. http://www.cs.cmu.edu/~mccallum/bow.
3. from Neural Networks Research Centre, HELSINKI UNIVERSITY OF TECHNOLOGY
§ SOM_PAK
§ Nenet
4. from AI Lab, University of Arizona
Schedule
(subject to change, last updated Aug/10/2003)
Week |
Topic |
Readings |
Due Dates |
1 |
Course
Logistic and Overview IR Academic Resources Document and Query Forms |
"What
do people want from IR" by Croft Korfhage Ch1-2 Van Rijsbergen Ch1-2 |
Self-introduction on webboard due. (Sept 6) |
2 |
Data Compression Query
Structures |
Korfhage Ch2 - 3 |
|
3 |
Matching
Process Text Analysis |
Korfhage Ch 3-5 Rijsbergen Ch2 (The Zipf's Law Part only) |
On-line search log due (Sept 20) |
4 |
Basics of UNIX Experiencing
an IR System Retrieval Experiment I Instructions |
Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering |
Finish reading Dr. Chen's book "Trailblazing a Path Towards Knowledge and Transformation" |
5 |
Document Similarity Document and Concept Classification/Clustering Techniques |
Rijsbergen Ch3 On the Automatic Classification of Accounting concepts: Preliminary Results of the Statistical Analysis of Term-Document Frequencies, by Gangolly and Wu, published in New Review of Applied Expert Systems and Emerging Technologies, pp 81-88, v 6, 2000. |
Design competition: Part I System Designs Due (Oct 04) |
6 |
Text Mining using Neural Networks
SOM and Nenet package overview Retrieval Experiment II Instructions |
AI LAB: A Scalable Self-Organizing Map Algorithm for Textual Classification: A Neural Network Approach to Automatic Thesaurus Generation (Roussinov & Chen, 1998) |
Design competition: Part II Votes Due (Oct 11)
Retrieval Exp I due (Oct 11) |
7 |
Document Warehousing | Sullivan Ch1 | Semester Project Proposal due (Oct 18) |
8 |
Information Extraction
Text Mining Applications |
Sullivan Ch13
1. Information Extraction: Techniques and Challenges, by Ralph Grishman 3. Combining Data and Text Mining Techniques For Analyzing Financial Reports |
Retrieval Exp II due (Oct 25) |
9 |
Retrieval Effectiveness Measures Output
Presentation |
Korfhage Ch 8, 11 Rijsbergen Ch 7 |
Finish reading Dr. Chen's book "Knowledge Management Systems: A Text Mining Perspective" |
10 |
IR Effectiveness Improvement Techniques: Relevance Feedback, Query Expansion, Local Context Analysis, and Word-Sense Disambiguation |
Korfhage Ch 9 |
|
11 |
User Profiles Alternative Retrieval Techniques |
Korfhage Ch 6, 10 S. Chakrabarti, B. Dom and P. Indyk. Enhanced hypertext categorization
using hyperlinks. Proceedings of ACM SIGMOD 1998. http://www.cs.toronto.edu/~wtjioe/mining/hypertext.pdf |
|
12 |
Final Exam |
covering all previous lectures and the 2 booklets by Dr. HsinChun Chen
The exam will be e-mailed to you and also posted on the webboard by 5pm Nov/21. It's due back via e-mail to Prof Wu by 12pm Nov/22 or be postmarked on Nov/22. |
|
13 |
Natural Language Processing | Little Words Can Make a Big Difference for Text Classification by Ellen Riloff | |
14 |
Semester Project Presentation Part I: Questions |
Each group, after reviewing other groups' work, has
to ask 1 question for each of the remaining groups by replying to other
groups' final project webboard messages. (Meaning there are 7 groups,
your group has to generate 6 questions.) The questions have to be very
carefully chosen so that they point out the problems of another group or
things that are not clear to you.) |
All Semester Projects and
Presentation Slides DEADLINE: midnight Dec 4.
Questions for other groups are due: Dec 10 |
15 |
Semester Project Presentation Part II: Answers |
All groups have to answer questions assigned to you by
Dec/13 (Sat). However, you should
answer them as soon as they are posted.
Divide work among your group members. The quality of your responses is important. |
Answers are due Dec/13. |