Instructor
Contact the Instructors:
Class Moodle
Introduction
Class Conduct and Attendance
Textbooks
Assignments, Grading, and Due Dates
§ Participation (10%)
§ Online Search Logs (5%)
§ Retrieval Experiments (20%)
§ Semester Group Project (35%)
§ Final Exam (30%)
Resources
Schedule
|
|
 |
Instructor |
| |
Instructor and Coordinator of this course: Professor
Yi-Fang Wu, Ph.D. (a.k.a Brook Wu)
|
| |
top
|
 |
Contact the Instructor: |
| |
Dr. Wu:
- In person: GITC 4104, Hours: Monday & Wednesday 2:30pm -- 4 pm
- Online: Yahoo messenger
ID: profbrookwu,
Hours: Monday and Wednesday 2:30-4pm. (Do NOT send
offline messages. Please use e-mail when I am not on Yahoo! messenger.)
- By Phone (during in person office hour; please do not leave voice messages):
(973) 596 - 5285
- By e-mail (fastest way to get a response): wu@njit.edu
|
| |
top
|
 |
Class Moodle (will be completely
setup soon) |
| |
We will be using moodle for discussion, assignment submission and grading.
Visit http://moodle.njit.edu and
please follow the tutorials, if you are not familiar with moodle's
user interface.
After
logining with your UCID. You should see IS 634 Section 851. |
| |
top
|
 |
Introduction |
| |
Information retrieval (IR) is a fast-changing field concerning the representation,
organization, storage, and retrieval of information items. A broader
definition of data types of information items includes text, numbers, multimedia
and more,
while a narrower definition includes only text. The instructor
recognizes the fact that there are courses offered for other types of retrievals,
so this
course will focus mostly on text retrieval.
The importance of text retrieval is obvious. Most business data is in
textual format. However, most text is not as well organized as
numerical data stored in commercial databases. This, along with
linguistic complexity, has caused the problems in text retrieval.
To achieve high retrieval
effectiveness, techniques such as automatic indexing, query expansion,
local context analysis, information extraction, text mining, and
many more have been
developed to overcome problems in IR. As an information professional,
you should know how to use these techniques to organize, store,
and retrieve text effectively and efficiently. This course is designed
to address both theories and practices of IR.
Pre-requisite for IS 634: IS 631
|
| |
top
|
 |
Class Conduct and Attendance: |
| |
This is a graduate course. As an NJIT graduate student, you must follow
the Institute's honor codes. Please refer to student handbook for
details. Specifically:
- Academic dishonesty is not allowed and will be reported to Dean of Student Services.
- All written assignments will be sent to www.turnitin.com, a plagiarism prevention system, for verification.
- Late assignments will be penalized 25% a day. No late submissions
will be allowed for the final semester project, and other major
deliverables.
- You are required to check announcements on class moodle at least 3
times a week.
- When posting messages on class web board, respect others and be considerate.
- If you have to drop the class after you are assigned to a project group, please notify the instructor and your group members immediately.
|
| |
top
|
 |
Textbooks |
| |
- Required 1: Information Storage and Retrieval, by Robert R. Korfhage, Publisher: John Wiley & Sons (ISBN: 0471143383)
- Required 2: Trailblazing
a Path Towards Knowledge and Transformation, by Dr. HsinChun Chen.
(a small booklet containing only 80 5"x7" pages) .
- Required 3: Knowledge
Management Systems: A Text Mining Perspective, by Dr. HsinChun Chen.
(a
small booklet containing only 50 5"x7" pages) .
- Optional 1: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales, by Dan Sullivan, Publisher: Wiley, 2001 (ISBN: 0471399590)
- Optional 2: Information
Retrieval (2nd Ed), by C. J. van Rijsbergen, Publisher: London:
Butterworths, 1979. It is available at the
author's web site: http://www.dcs.gla.ac.uk/Keith/Preface.html
Supplemental readings are available through the links on the course schedule below. |
| |
top
|
 |
Assignments and Grading: (subject
to change) |
| |
Note: All written assignments should be word-processed and posted on
web board. Do not submit hard-copies.
The group presentation slides should be posted as Power Point attachments.
Remember to put your name on your assignments, but do not list your
social security
number!!
Last but not least, all written assignments should include proper
citations!
-
Participation 10%
- Self-introduction and reply to one other student's intro (1%)
- IR system design competition (6%)
- Class participation on Moodle(3%).
-
Assignment 5%
- An Online Search Log (5%)
Two Retrieval Experiments 20%
Semester
Project 35%
- Proposal (5%)
- Project and presentation (30%)
Final Exam 30%
Total 100%
|
| |
top
|
Following describes the
assignments and projects.
|
| |
|
| |
-
Self- introduction on Moodle. Please introduce yourself
in "Introductions" conference
on class Moodle. Reply to and get to know at least one of the
classmates.(1%)
-
IR design competitions.(6%): Each student works individually to design
an IR system using flowcharts. The system should have two main
components: automatic indexing, and retrieval. For automatic
indexing, include the 3 components in Lecture 3, Part 2, slide
3. For retrieval, please use Boolean retrieval where only AND
and OR operators are used. Use Visio or the drawing function
in Word to draw the flow charts.
-
Class participation on Moodle: The instructor
will post topics for discussions on web board. Please respond
to them. You are also welcome to provide answers to
other students' questions as well. Respect others and be considerate
when
you respond to postings!! Constructive postings are
examples of "good participation," not the number of postings.(3%)
|
| |
§ An
Online Search Log (5%): |
| |
This assignment appears to be easy and straightforward. However,
if you carefully record the whole procedure, you shall find many IR
problems during the whole exercise. TThe instructor will show you how
this exercise relates to other IR topics later in the semester.
- Part I: Search and record any useful information for the task assigned
(see below). Use as many resources
as you can, including
web search engines, web directories, or electronic journal databases
available at NJIT library web site. (The instructor might not have
access to other resources.)
Task: Please find necessary information on "How
can text mining contribute to competitive advantage?" (Note:
1. The sentence describing the task should not be the only
query you use to search for information.
You should come up with your own queries. Through this experience,
you would learn why it is difficult for users to find information.
2. Even if you know the answer without the need to search for
information, please pretend you don't know the answer and go
on with the assignment.)
For each search session, be sure to record the following
items along the search process:
- The search engine/directory/electronic database you used
- The search query you entered. (If you entered several queries to
one search tool, treat them as different sessions.)
- Total number of search hits returned.
- In the top twenty returned documents, find out the number of documents/hits
actually relevant to your query and their URLs (or document
titles, if an electronic database is used).
- Among all returned hits, what is the number of returned documents
you browsed through?
Note:
- If you use more than one query for a particular search engine, please
record item 2, 3, 4, and 5, repeatedly.
- If you do not find useful information and decide to use other resources,
please repeat the above steps.
- Clearly mark the URL or title of the best site/best paper obtained
from your search.
-
Part II: Read "What
Do People Want from IR" by Croft. Based on your experience as
a web search engine and/or text database user, write a short
paper (2 single-spaced pages) called "What Do I need from IR?" List
at least, but not limited to, 3 items. You do not need to pick
items from Croft's paper for discussions. For each item, please
provide examples of frustrations that you experienced from
the first part of this assignment.
|
| |
§ Retrieval Experiments
(20%): |
| |
Note that each student will work individually and will use a unique
query; therefore, none of the results of your experiments will
be identical. Results of both retrieval experiments should be posted
on your own web site. Note: To gain access to the test collections
and the IR system, all students should get an AFS account. Please follow
instructions on http://newaccount.njit.edu/.
Please direct
all your questions to Computing Help Desk 973-596-2900.
Assigned queries for your RE I and II
(The table will be filled out soon after withdraw deadline.
Please refer to LISA.QUE to find out the exact query.)
| Q1 |
Aly, F |
| Q2 |
Bellamy, W |
| Q3 |
Bleik, S |
| Q4 |
Carmen, T |
| Q5 |
Professor Wu |
| Q6 |
Chiu, L
|
| Q7 |
Field, C
|
| Q8 |
Gutta, H
|
| Q9 |
Hussain, M
|
| Q10 |
Hytmiah, P
|
| Q11 |
Jha, F
|
| Q12 |
Kruaysiriwong, D
|
| Q13 |
Mahipan, C
|
| Q14 |
Malamug, D
|
| Q15 |
Nersesian, E
|
| Q16 |
Paredes Gomero, C
|
| Q17 |
Sultana, B
|
| Q18 |
Uplenchwar, V
|
| Q19 |
Watrous-deVersterre, L
|
| Q20 |
Wong, C
|
| Q24 |
Yang, K
|
| Q25 |
Yang, M
|
| Q26 |
Zitzler, G
|
| Q27 |
|
| Q28 |
|
|
Q29
|
|
|
Q30
|
|
|
Q31
|
|
|
Q32
|
|
|
Q33
|
|
|
Q34
|
|
|
Q35
|
|
| |
|
-
RE I (10%): You are required to use an IR system, a test
collection, and a query, and then run several retrieval experiments
using Arrow of BOW toolkit (see Resources section of the syllabus
below). Instructions are in Lecture 4-2 slides. To present the
results of your Retrieval Experiment 1, please use this template: http://web.njit.edu/~wu/CIS634/CIS634re1.html
-
RE II (10%): You are required to create a small document collection
and a document-term matrix based on the results from your RE1, and
perform documents classification/clustering analysis using Rainbow
and Nenet
(see Resources below). Instructions are in Lecture 6-2 slides. To
present the results of your Retrieval Experiment 2, please use
this template: http://web.njit.edu/~wu/CIS634/CIS634re2.html
|
| |
§ Semester
Group Project (35%): |
| |
Two options: case analysis project or programming project. Instructions here.
-
Option 1: Case analysis:
You are required to make up a fictional client and its business problems.
- Part I (5%): A proposal defining your client, its business problems,
business environment (competitors and competing products,
if any) and at least 2 possible sources of documents for
text analysis. Please discuss the sources of documents: why
do you choose them? do they have enough documents for this
project? how are the documents generated? (e.g. user reviews
posted by real customers, or expert reviews , etc). Please
provide as much details as possible.
- Part II (25%): Use text mining software programs to collect,
pre-process, index, mine and analyze the documents you
collected. Deliverables: 1. All documents you collected;
2. Power Point slides for presentation;
3. A final report containing a. 1 page execute summary,
b. main report: tools used and screen shots of outputs, c.
analysis of results and d. recommended
solutions to the client's business problems.
- Part III: Presentation (5%)
- How to prepare deliverables? Please click here.
Good case analysis projects by former students
Option 2: Programming Project
-
You can design your own project with the instructor's
approval. Sample choices are: text retrieval systems, information
extraction systems,
automatic summarization systems, etc.
- Part I (5%): A proposal describing your project, including
systems functions and tools (programming languages, databases,
etc) that you will use to develop the system. Most importantly, please
use flow charts to demonstrate how your system works. Please
provide as much details as possible.
- Part II (30%):
- A. System development using any programming language
that your group is most familiar with. (You will have all the
necessary IR concepts and theories from lectures. The instructor
will not spend
time on discussing the implementation. For example, the instructor
will discuss what is an inverted file and how it can be used
for automatic indexing and retrieval, but not how to use Java or
C++ to implement a program to generate it)
- B. System
design and documentation (including flow charts, user manual presented
with screen shots, and the evaluation of system performance).
- Part III: Presentation (5%)
- How to prepare deliverables? Please click here.
|
| |
|
| |
It covers all previous lectures and two booklets written by Dr. HsinChun Chen. |
| |
top
|
 |
Resources |
| |
1. from The Information Retrieval
Group at University at Glasgow:
2. from McCallum, Andrew Kachites, Computer Science Dept, Carnegie Mellon University
3. from Neural Networks Research Centre, HELSINKI UNIVERSITY OF TECHNOLOGY
4. from AI Lab, University of Arizona
|
| |
top
|
| § Schedule
(subject to change) § |
Week |
Topic |
Readings |
|
|
Course Logistic and Overview
IR Academic Resources
|
Slides: 1-1
|
|
|
IR overview |
"What
do people want from IR " by
Croft
Slides: 1-2 |
|
|
Document and Query Forms |
Korfhage Ch1-2
Van Rijsbergen Ch1-2
Slides: 2-1, 2-2 |
|
|
Data Compression
Query Structures |
Korfhage Ch2 - 3
Slides: 2-2 (cont), 3-1 |
|
|
Matching Process
Text Analysis
|
Korfhage Ch 3-5
Rijsbergen Ch2 (The Zipf's Law Part only)
Slides: 3-2 |
|
|
TF.IDF
Basics of UNIX
Experiencing an IR System
Retrieval Experiment I Instructions |
Bow: A
toolkit for statistical language modeling, text retrieval, classification
and clustering
Slides: 3-2 (cont), 4-1, 4-2 |
|
|
Overview
of final project
Document Similarity
Document and Concept Classification/Clustering Techniques
Text Mining using Neural Networks
|
Rijsbergen Ch3
Slides: 5-1, 5-2 (quickly skim through), 6-1 |
|
|
SOM and Nenet package
overview (Retrieval Experiment II Instructions)
Document Warehousing
|
AI
LAB: A
Scalable Self-Organizing Map Algorithm for Textual Classification:
A Neural Network Approach to Automatic Thesaurus Generation (Roussinov & Chen,
1998)
Sullivan Ch1, Ch4
Slides: 6-2, 7-1
|
|
|
|
Information Extraction
Text Mining Applications |
1. Sullivan Ch13
2. Information
Extraction: Techniques and Challenges, by Ralph Grishman
3. An
interactive system for finding complementary literatures: a stimulus
to scientific discovery, Artificial Intelligence, Volume 91, Issue
2, April 1997, Pages 183-203 Don R. Swanson and Neil R. Smalheiser
4. Combining
Data and Text Mining Techniques For Analyzing Financial Reports
Slides: 8-1, 8-2 |
|
|
Retrieval Effectiveness Measures
Output Presentation
|
Korfhage Ch 8, 11
Rijsbergen Ch 7
Slides: 9-1, 9-2 |
|
|
IR Effectiveness Improvement Techniques: Relevance Feedback,
Query Expansion, Local Context Analysis, and Word-Sense Disambiguation |
Korfhage Ch 9
Slides: 10-1 |
|
|
User Profiles
Alternative Retrieval Techniques
|
Korfhage Ch 6, 10
S. Chakrabarti, B. Dom and P. Indyk. Enhanced
hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD
1998.
J. Kleinberg. Authoritative
sources in a hyperlinked environment.
Slides: 11-1, 11-2 |
|
|
Natural Language Processing
|
Little
Words Can Make a Big Difference for Text Classification by Ellen
Riloff
Slides: 12 |
|
|
Final Project Week |
Final Project Due Dec/06 NO late projects
will be accepted.
|
|
|
Final Exam |
covering all previous lectures and the 2 booklets by Dr. HsinChun
Chen.
Take home exam, will be distributed on Moodle.
Dec/10, 7pm - Dec/11, 2 am
(The actual exam will not take you that long, if you are well-prepared.
This 7-hour duration gives flexibility for students with kids or who
can't start on time.)
NO late exams will be accepted.
|
top
|
|