|
Note:
- The classroom is changed and now it meets in Cullimore, Lecture Hall
3, on Thursday 6-7:30pm.
- The course is changed back to hybrid mode. Please purchase lecture CDs
from NJIT bookstore or borrow them from NJIT library. Please
follow syllabus to find out specific class materials for each
week. Listen to CDs before you go to class. As we will only go
over the important materials and we will conduct class
activities in class.
- Please check back regularly on the official syllabus posted on Prof
Wu's web site for most up-to-date information.
Instructor
Contact the Instructors:
Class WebCT
Introduction
Class Conduct and Attendance
Textbooks
Assignments, Grading, and Due Dates
§ Participation (10%)
§ Online Search Logs (5%)
§ Retrieval Experiments (20%)
§ Semester Group Project (35%)
§ Final Exam (30%)
Resources
Schedule
|
|
 |
Instructor |
| |
Instructor and Coordinator of this course: Professor
Yi-Fang Wu, Ph.D. (a.k.a Brook Wu)
|
| |
top
|
 |
Contact the Instructor: |
| |
Dr. Wu:
- In person: GITC 4104, Hours: Monday 4:15 pm -- 6:30 pm, Thursday 3:30
--5:30pm (right before class).
- Online: Yahoo messenger
ID: profbrookwu,
Hours: same as in-person office hours. (Do NOT send
offline messages. Please use e-mail when I am not on Yahoo! messenger.)
- By e-mail (fastest way to get a response): wu@njit.edu
- By Phone (during in person office hour; please do not leave voice messages):
(973) 596 - 5285
|
| |
top
|
 |
Class WebCT |
| |
http://webct.njit.edu and login with your UCID. You should see IS 634 on
the list of WebCTs you have access to. (Our WebCT should be ready by
Jan/15.) |
| |
top
|
 |
Introduction |
| |
Information retrieval (IR) is a fast-changing field concerning the representation,
organization, storage, and retrieval of information items. A broader
definition of data types of information items includes text, numbers, multimedia
and more,
while a narrower definition includes only text. The instructor
recognizes the fact that there are courses offered for other types of retrievals,
so this
course will focus mostly on text retrieval.
The importance of text retrieval is obvious. Most business data is in
textual format. However, most text is not as well organized as
numerical data stored in commercial databases. This, along with
linguistic complexity, has caused the problems in text retrieval.
To achieve high retrieval
effectiveness, techniques such as automatic indexing, query expansion,
local context analysis, information extraction, text mining, and
many more have been
developed to overcome problems in IR. As an information professional,
you should know how to use these techniques to organize, store,
and retrieve text effectively and efficiently. This course is designed
to address both theories and practices of IR.
Pre-requisite for IS 634: IS 631
Pre-requisite for IS 392: Math 333 |
| |
top
|
 |
Class Conduct and Attendance: |
| |
This is a graduate course. As an NJIT graduate student, you must follow
the Institute's honor codes. Please refer to student handbook for
details. Specifically:
- Academic dishonesty is not allowed and will be reported to Dean of Student Services.
- All written assignments will be sent to www.turnitin.com, a plagiarism prevention system, for verification.
- Late assignments will be penalized 25% a day. No late submissions
will be allowed for the final semester project, and other major
deliverables.
- You are required to check WebCT announcements at least 3 times a week.
- When posting messages on class web board, respect others and be considerate.
- You are expected to attend all class meetings ON TIME. Class attendance
will be recorded every meeting. Poor attendance will negatively
affect your semester grade.
- If you have to drop the class after you are assigned to a project group, please notify the instructor and your group members immediately.
- Snow closing information will be available on the NJIT web
site. The instructor does not make the decision.
|
| |
top
|
 |
Textbooks |
| |
- Required 1: Information Storage and Retrieval, by Robert R. Korfhage, Publisher: John Wiley & Sons (ISBN: 0471143383)
- Required 2: Trailblazing
a Path Towards Knowledge and Transformation, by HsinChun Chen. It
is available at its entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/Chen2Book.pdf (a
small booklet containing only 80 5"x7" pages) .
- Required 3: Knowledge
Management Systems: A Text Mining Perspective, by HsinChun Chen.
It is available at its entirety at the author's web site: http://ai.bpa.arizona.edu/go/download/chenKMSi.pdf (a
small booklet containing only 50 5"x7" pages) .
- Optional 1: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales, by Dan Sullivan, Publisher: Wiley, 2001 (ISBN: 0471399590)
- Optional 2: Information Retrieval (2nd Ed), by C. J. van Rijsbergen, Publisher: London: Butterworths, 1979. It is available at its entirety at the author's web site: http://www.dcs.gla.ac.uk/Keith/Preface.html
Supplemental readings are available through the links on the course schedule below. |
| |
top
|
 |
Assignments and Grading: (subject
to change) |
| |
Note: All written assignments should be word-processed and posted on
web board. Do not submit hard-copies.
The group presentation slides should be posted as Power Point attachments.
Remember to put your name on your assignments, but do not list your
social security
number!!
Last but not least, all written assignments should include proper
citations!
-
Participation 10%
- Self-introduction and reply to one other student's intro (1%)
- IR system design competition (6%)
- WebCT and in class participation (3%).
-
Assignment 5%
- An Online Search Log (5%)
Two Retrieval Experiments 20%
Semester
Project 35%
- Proposal (5%)
- Project and presentation (30%)
Final Exam 30%
Total 100%
|
| |
top
|
Following describes the
assignments and projects.
|
| |
|
| |
-
Self- introduction on the web board. Please introduce yourself
in "Introductions" conference
on class WebCT. Reply to and get to know at least one of the
classmates.(1%)
-
IR design competitions.(6%): Each student works individually to design
an IR system using flowcharts. The system should have two main
components: automatic indexing, and retrieval. For automatic
indexing, include the 3 components in Lecture 3, Part 2, slide
3. For retrieval, please use Boolean retrieval where only AND
and OR operators are used. Use Visio or the drawing function
in Word to draw the flow charts.
-
In-class and (or) WebCT participation: The instructor
will post topics for discussions on web board. Please respond
to them. You are also welcome to provide answers to
other students' questions as well. Respect others and be considerate
when
you respond to postings!! Constructive postings are
examples of "good participation," not the number of postings.(3%)
|
| |
§ An
Online Search Log (5%): |
| |
This assignment appears to be easy and straightforward. However,
if you carefully record the whole procedure, you shall find many IR
problems during the whole exercise. TThe instructor will show you how
this exercise relates to other IR topics later in the semester.
- Part I: Search and record any useful information for the task assigned
(see below). Use as many resources
as you can, including
web search engines, web directories, or electronic journal databases
available at NJIT library web site. (The instructor might not have
access to other resources.)
Task: Please find necessary information on "How
can text mining contribute to competitive advantage?" (Note:
1. The sentence describing the task should not be the only
query you use to search for information.
You should come up with your own queries. Through this experience,
you would learn why it is difficult for users to find information.
2. Even if you know the answer without the need to search for
information, please pretend you don't know the answer and go
on with the assignment.)
For each search session, be sure to record the following
items along the search process:
- The search engine/directory/electronic database you used
- The search query you entered. (If you entered several queries to
one search tool, treat them as different sessions.)
- Total number of search hits returned.
- In the top twenty returned documents, find out the number of documents/hits
actually relevant to your query and their URLs (or document
titles, if an electronic database is used).
- Among all returned hits, what is the number of returned documents
you browsed through?
Note:
- If you use more than one query for a particular search engine, please
record item 2, 3, 4, and 5, repeatedly.
- If you do not find useful information and decide to use other resources,
please repeat the above steps.
- Clearly mark the URL or title of the best site/best paper obtained
from your search.
-
Part II: Read "What
Do People Want from IR" by Croft. Based on your experience as
a web search engine and/or text database user, write a short
paper (2 single-spaced pages) called "What Do I need from IR?" List
at least, but not limited to, 3 items. You do not need to pick
items from Croft's paper for discussions. For each item, please
provide examples of frustrations that you experienced from
the first part of this assignment.
|
| |
§ Retrieval Experiments
(20%): |
| |
Note that each student will work individually and will use a unique
query; therefore, none of the results of your experiments will
be identical. Results of both retrieval experiments should be posted
on your own web site. Note: To gain access to the test collections
and the IR system, all students should get an AFS account. Please follow
instructions on http://newaccount.njit.edu/.
Please direct
all your questions to Computing Help Desk 973-596-2900.
Assigned queries for your RE I and II
(The table will be filled out soon after withdraw deadline.
Please refer to LISA.QUE to find out the exact query.)
| Q1 |
Bekele, Mahlet |
| Q2 |
Fiagbe, Peter |
| Q3 |
Idris, Yomi |
| Q4 |
Kahlon, Biplavjit |
| Q5 |
Professor Wu |
| Q6 |
Kobylinski, Mirko |
| Q7 |
Krikun, Anastasiya |
| Q8 |
Li, Xugong |
| Q9 |
Monroy, Alain |
| Q10 |
Nagdev, Umesh |
| Q11 |
Neyra, Wilson |
| Q12 |
Pax, Christopher |
| Q13 |
Schuler, Richard |
| Q14 |
Simpson, Melford |
| Q15 |
Srinivasan, Anand |
| Q16 |
Tao, Jyun-Ze |
| Q17 |
Terranova, Joseph |
| Q18 |
Thiaw, Lamine |
| Q19 |
Vankawala, Maulikkumar |
| Q20 |
Wiggins, Kelly |
| Q24 |
Wong, Chi |
| Q25 |
|
| Q26 |
|
| Q27 |
|
| Q28 |
|
|
Q29
|
|
|
Q30
|
|
|
Q31
|
|
|
Q32
|
|
|
Q33
|
|
|
Q34
|
|
|
Q35
|
|
| |
|
-
RE I (10%): You are required to use an IR system, a test
collection, and a query, and then run several retrieval experiments
using Arrow of BOW toolkit (see Resources section of the syllabus
below). Instructions are in Lecture 4-2 slides. To present the
results of your Retrieval Experiment 1, please use this template: http://web.njit.edu/~wu/CIS634/CIS634re1.html
-
RE II (10%): You are required to create a small document collection
and a document-term matrix based on the results from your RE1, and
perform documents classification/clustering analysis using Rainbow
and Nenet
(see Resources below). Instructions are in Lecture 6-2 slides. To
present the results of your Retrieval Experiment 2, please use
this template: http://web.njit.edu/~wu/CIS634/CIS634re2.html
|
| |
§ Semester
Group Project (35%): |
| |
Two options: case analysis project or programming project. Instructions here.
-
Option 1: Case analysis:
You are required to make up a fictional client and its business problems.
- Part I (5%): A proposal defining your client, its business problems,
business environment (competitors and competing products,
if any) and at least 2 possible sources of documents for
text analysis. Please discuss the sources of documents: why
do you choose them? do they have enough documents for this
project? how are the documents generated? (e.g. user reviews
posted by real customers, or expert reviews , etc). Please
provide as much details as possible.
- Part II (25%): Use text mining software programs to collect,
pre-process, index, mine and analyze the documents you
collected. Deliverables: 1. All documents you collected;
2. Power Point slides for presentation;
3. A final report containing a. 1 page execute summary,
b. main report: tools used and screen shots of outputs, c.
analysis of results and d. recommended
solutions to the client's business problems.
- Part III: Presentation (5%)
- How to prepare deliverables? Please click here.
Good case analysis projects by former students
Option 2: Programming Project
-
You can design your own project with the instructor's
approval. Sample choices are: text retrieval systems, information
extraction systems,
automatic summarization systems, etc.
- Part I (5%): A proposal describing your project, including
systems functions and tools (programming languages, databases,
etc) that you will use to develop the system. Most importantly, please
use flow charts to demonstrate how your system works. Please
provide as much details as possible.
- Part II (30%):
- A. System development using any programming language
that your group is most familiar with. (You will have all the
necessary IR concepts and theories from lectures. The instructor
will not spend
time on discussing the implementation. For example, the instructor
will discuss what is an inverted file and how it can be used
for automatic indexing and retrieval, but not how to use Java or
C++ to implement a program to generate it)
- B. System
design and documentation (including flow charts, user manual presented
with screen shots, and the evaluation of system performance).
- Part III: Presentation (5%)
- How to prepare deliverables? Please click here.
|
| |
|
| |
It covers all previous lectures and two booklets written by Dr. HsinChun Chen. |
| |
top
|
 |
Resources |
| |
1. from The Information Retrieval
Group at University at Glasgow:
2. from McCallum, Andrew Kachites, Computer Science Dept, Carnegie Mellon University
3. from Neural Networks Research Centre, HELSINKI UNIVERSITY OF TECHNOLOGY
4. from AI Lab, University of Arizona
|
| |
top
|
| § Schedule
(subject to change, last updated January/07/2008) § |
Week |
Topic |
Readings |
Due Dates |
|
|
Course Logistic and Overview
IR Academic Resources
|
Slides: 1-1
|
|
|
|
IR overview |
"What
do people want from IR " by
Croft
Slides: 1-2 |
Self-introduction on WebCT due Feb/03. |
|
|
Document and Query Forms |
Korfhage Ch1-2
Van Rijsbergen Ch1-2
Slides: 2-1, 2-2 |
On-line search log due Feb10. |
|
|
Data Compression
Query Structures |
Korfhage Ch2 - 3
Slides: 2-2 (cont), 3-1 |
|
|
|
Matching Process
Text Analysis
|
Korfhage Ch 3-5
Rijsbergen Ch2 (The Zipf's Law Part only)
Slides: 3-2 |
IR competition
in-class
|
|
|
TF.IDF
Basics of UNIX
Experiencing an IR System
Retrieval Experiment I Instructions |
Bow: A
toolkit for statistical language modeling, text retrieval, classification
and clustering
Slides: 3-2 (cont), 4-1, 4-2 |
Finish reading Dr. Chen's book "Trailblazing
a Path Towards Knowledge and Transformation"
|
|
|
Overview of Final Projects
Document Similarity
Document and Concept Classification/Clustering Techniques
Text Mining using Neural Networks
|
Rijsbergen Ch3
Slides: 5-1, 5-2 (quickly skim through), 6-1 |
Retrieval Exp I due 03/09
|
|
|
SOM and Nenet package
overview (Retrieval Experiment II Instructions)
Document Warehousing
|
AI
LAB: A
Scalable Self-Organizing Map Algorithm for Textual Classification:
A Neural Network Approach to Automatic Thesaurus Generation (Roussinov & Chen,
1998)
Sullivan Ch1
Slides: 6-2, 7-1
|
Semester Project Proposal due 03/16 |
Spring Break 03/17-23 |
|
|
Information Extraction
Text Mining Applications |
1. Sullivan Ch13
2. Information
Extraction: Techniques and Challenges, by Ralph Grishman
3. An
interactive system for finding complementary literatures: a stimulus
to scientific discovery, Artificial Intelligence, Volume 91, Issue
2, April 1997, Pages 183-203 Don R. Swanson and Neil R. Smalheiser
4. Combining
Data and Text Mining Techniques For Analyzing Financial Reports
Slides: 8-1, 8-2 |
|
|
|
Retrieval Effectiveness Measures
Output Presentation
|
Korfhage Ch 8, 11
Rijsbergen Ch 7
Slides: 9-1, 9-2 |
Retrieval Exp II due 04/06 |
|
|
IR Effectiveness Improvement Techniques: Relevance Feedback,
Query Expansion, Local Context Analysis, and Word-Sense Disambiguation |
Korfhage Ch 9
Slides: 10-1 |
Finish reading Dr. Chen's book "Knowledge
Management Systems: A Text Mining Perspective" |
|
|
User Profiles
Alternative Retrieval Techniques
|
Korfhage Ch 6, 10
S. Chakrabarti, B. Dom and P. Indyk. Enhanced
hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD
1998.
J. Kleinberg. Authoritative
sources in a hyperlinked environment.
Slides: 11-1, 11-2 |
|
|
|
Natural Language Processing
|
Little
Words Can Make a Big Difference for Text Classification by Ellen
Riloff
Slides: 12 |
|
|
|
Final Project Week |
|
Final Project Due 04/30
|
|
|
Final Exam |
covering all previous lectures and the 2 booklets by Dr. HsinChun
Chen.
|
TBA |
top
|
|