home | research | awards | grants | teaching | students | contact
IS 634 and IS 392 - Spring 2008

Note:

  • The classroom is changed and now it meets in Cullimore, Lecture Hall 3, on Thursday 6-7:30pm.
  • The course is changed back to hybrid mode. Please purchase lecture CDs from NJIT bookstore or borrow them from NJIT library. Please follow syllabus to find out specific class materials for each week. Listen to CDs before you go to class. As we will only go over the important materials and we will conduct class activities in class.
  • Please check back regularly on the official syllabus posted on Prof Wu's web site for most up-to-date information.
  • Instructor
  • Contact the Instructors:
  • Class WebCT
  • Introduction
  • Class Conduct and Attendance
  • Textbooks
  • Assignments, Grading, and Due Dates
  • § Participation (10%)
    § Online Search Logs (5%)
    § Retrieval Experiments (20%)
    § Semester Group Project (35%)
    § Final Exam (30%)

  • Resources
  • Schedule

  • Instructor
     

    Instructor and Coordinator of this course: Professor Yi-Fang Wu, Ph.D. (a.k.a Brook Wu)

      top
    Contact the Instructor:
     

    Dr. Wu:

    • In person: GITC 4104, Hours: Monday 4:15 pm -- 6:30 pm, Thursday 3:30 --5:30pm (right before class).
    • Online: Yahoo messenger ID: profbrookwu, Hours: same as in-person office hours. (Do NOT send offline messages. Please use e-mail when I am not on Yahoo! messenger.)
    • By e-mail (fastest way to get a response): wu@njit.edu
    • By Phone (during in person office hour; please do not leave voice messages): (973) 596 - 5285

     

      top
    Class WebCT
      http://webct.njit.edu and login with your UCID. You should see IS 634 on the list of WebCTs you have access to. (Our WebCT should be ready by Jan/15.)
      top
    Introduction
     

    Information retrieval (IR) is a fast-changing field concerning the representation, organization, storage, and retrieval of information items. A broader definition of data types of information items includes text, numbers, multimedia and more, while a narrower definition includes only text. The instructor recognizes the fact that there are courses offered for other types of retrievals, so this course will focus mostly on text retrieval.

    The importance of text retrieval is obvious. Most business data is in textual format. However, most text is not as well organized as numerical data stored in commercial databases. This, along with linguistic complexity, has caused the problems in text retrieval. To achieve high retrieval effectiveness, techniques such as automatic indexing, query expansion, local context analysis, information extraction, text mining, and many more have been developed to overcome problems in IR. As an information professional, you should know how to use these techniques to organize, store, and retrieve text effectively and efficiently. This course is designed to address both theories and practices of IR.

    Pre-requisite for IS 634: IS 631

    Pre-requisite for IS 392: Math 333

      top
    Class Conduct and Attendance:
     

    This is a graduate course. As an NJIT graduate student, you must follow the Institute's honor codes. Please refer to student handbook for details. Specifically:

    • Academic dishonesty is not allowed and will be reported to Dean of Student Services.
    • All written assignments will be sent to www.turnitin.com, a plagiarism prevention system, for verification.
    • Late assignments will be penalized 25% a day. No late submissions will be allowed for the final semester project, and other major deliverables.
    • You are required to check WebCT announcements at least 3 times a week.
    • When posting messages on class web board, respect others and be considerate.
    • You are expected to attend all class meetings ON TIME. Class attendance will be recorded every meeting. Poor attendance will negatively affect your semester grade.
    • If you have to drop the class after you are assigned to a project group, please notify the instructor and your group members immediately.
    • Snow closing information will be available on the NJIT web site. The instructor does not make the decision.
      top
    Textbooks
     

    Supplemental readings are available through the links on the course schedule below.

      top
    Assignments and Grading: (subject to change)
     

    Note: All written assignments should be word-processed and posted on web board. Do not submit hard-copies. The group presentation slides should be posted as Power Point attachments. Remember to put your name on your assignments, but do not list your social security number!! Last but not least, all written assignments should include proper citations!

    1. Participation 10%

      • Self-introduction and reply to one other student's intro (1%)
      • IR system design competition (6%)
      • WebCT and in class participation (3%).

    2. Assignment 5%
      • An Online Search Log (5%)

    3. Two Retrieval Experiments 20%

      • RE I (10%)
      • RE II (10%)

    4. Semester Project 35%

      • Proposal (5%)
      • Project and presentation (30%)

    5. Final Exam 30%

      • Final Exam (30%)

    Total 100%
      top
    Following describes the assignments and projects.
     
    § Participation (10%):
     
    • Self- introduction on the web board. Please introduce yourself in "Introductions" conference on class WebCT. Reply to and get to know at least one of the classmates.(1%)

    • IR design competitions.(6%): Each student works individually to design an IR system using flowcharts. The system should have two main components: automatic indexing, and retrieval. For automatic indexing, include the 3 components in Lecture 3, Part 2, slide 3. For retrieval, please use Boolean retrieval where only AND and OR operators are used. Use Visio or the drawing function in Word to draw the flow charts.

    • In-class and (or) WebCT participation: The instructor will post topics for discussions on web board. Please respond to them. You are also welcome to provide answers to other students' questions as well. Respect others and be considerate when you respond to postings!! Constructive postings are examples of "good participation," not the number of postings.(3%)

     
    § An Online Search Log (5%):
     

    This assignment appears to be easy and straightforward. However, if you carefully record the whole procedure, you shall find many IR problems during the whole exercise. TThe instructor will show you how this exercise relates to other IR topics later in the semester.

    • Part I: Search and record any useful information for the task assigned (see below). Use as many resources as you can, including web search engines, web directories, or electronic journal databases available at NJIT library web site. (The instructor might not have access to other resources.)
    • Task: Please find necessary information on "How can text mining contribute to competitive advantage?" (Note: 1. The sentence describing the task should not be the only query you use to search for information. You should come up with your own queries. Through this experience, you would learn why it is difficult for users to find information. 2. Even if you know the answer without the need to search for information, please pretend you don't know the answer and go on with the assignment.)

      For each search session, be sure to record the following items along the search process:

      1. The search engine/directory/electronic database you used
      2. The search query you entered. (If you entered several queries to one search tool, treat them as different sessions.)
      3. Total number of search hits returned.
      4. In the top twenty returned documents, find out the number of documents/hits actually relevant to your query and their URLs (or document titles, if an electronic database is used).
      5. Among all returned hits, what is the number of returned documents you browsed through?

      Note:

      • If you use more than one query for a particular search engine, please record item 2, 3, 4, and 5, repeatedly.
      • If you do not find useful information and decide to use other resources, please repeat the above steps.
      • Clearly mark the URL or title of the best site/best paper obtained from your search.

    • Part II: Read "What Do People Want from IR" by Croft. Based on your experience as a web search engine and/or text database user, write a short paper (2 single-spaced pages) called "What Do I need from IR?" List at least, but not limited to, 3 items. You do not need to pick items from Croft's paper for discussions. For each item, please provide examples of frustrations that you experienced from the first part of this assignment.

     
    § Retrieval Experiments (20%):
     

    Note that each student will work individually and will use a unique query; therefore, none of the results of your experiments will be identical. Results of both retrieval experiments should be posted on your own web site. Note: To gain access to the test collections and the IR system, all students should get an AFS account. Please follow instructions on http://newaccount.njit.edu/. Please direct all your questions to Computing Help Desk 973-596-2900.

    Assigned queries for your RE I and II

    (The table will be filled out soon after withdraw deadline. Please refer to LISA.QUE to find out the exact query.)

    Q1
    Bekele, Mahlet
    Q2
    Fiagbe, Peter
    Q3
    Idris, Yomi
    Q4
    Kahlon, Biplavjit
    Q5 Professor Wu
    Q6
    Kobylinski, Mirko
    Q7
    Krikun, Anastasiya
    Q8
    Li, Xugong
    Q9
    Monroy, Alain
    Q10
    Nagdev, Umesh
    Q11
    Neyra, Wilson
    Q12
    Pax, Christopher
    Q13
    Schuler, Richard
    Q14
    Simpson, Melford
    Q15
    Srinivasan, Anand
    Q16
    Tao, Jyun-Ze
    Q17
    Terranova, Joseph
    Q18
    Thiaw, Lamine
    Q19
    Vankawala, Maulikkumar
    Q20
    Wiggins, Kelly
    Q24
    Wong, Chi
    Q25  
    Q26  
    Q27  
    Q28  

    Q29

     

    Q30

     

    Q31

     

    Q32

     

    Q33

     

    Q34

     

    Q35

     
       

    • RE I (10%): You are required to use an IR system, a test collection, and a query, and then run several retrieval experiments using Arrow of BOW toolkit (see Resources section of the syllabus below). Instructions are in Lecture 4-2 slides. To present the results of your Retrieval Experiment 1, please use this template: http://web.njit.edu/~wu/CIS634/CIS634re1.html

    • RE II (10%): You are required to create a small document collection and a document-term matrix based on the results from your RE1, and perform documents classification/clustering analysis using Rainbow and Nenet (see Resources below). Instructions are in Lecture 6-2 slides. To present the results of your Retrieval Experiment 2, please use this template: http://web.njit.edu/~wu/CIS634/CIS634re2.html

     
    § Semester Group Project (35%):
     

    Two options: case analysis project or programming project. Instructions here.

    • Option 1: Case analysis:

    • You are required to make up a fictional client and its business problems.

      • Part I (5%): A proposal defining your client, its business problems, business environment (competitors and competing products, if any) and at least 2 possible sources of documents for text analysis. Please discuss the sources of documents: why do you choose them? do they have enough documents for this project? how are the documents generated? (e.g. user reviews posted by real customers, or expert reviews , etc). Please provide as much details as possible.
      • Part II (25%): Use text mining software programs to collect, pre-process, index, mine and analyze the documents you collected. Deliverables: 1. All documents you collected; 2. Power Point slides for presentation; 3. A final report containing a. 1 page execute summary, b. main report: tools used and screen shots of outputs, c. analysis of results and d. recommended solutions to the client's business problems.
      • Part III: Presentation (5%)
      • How to prepare deliverables? Please click here.

      Good case analysis projects by former students


      Option 2: Programming Project

    • You can design your own project with the instructor's approval. Sample choices are: text retrieval systems, information extraction systems, automatic summarization systems, etc.

      • Part I (5%): A proposal describing your project, including systems functions and tools (programming languages, databases, etc) that you will use to develop the system. Most importantly, please use flow charts to demonstrate how your system works. Please provide as much details as possible.
      • Part II (30%):
      • A. System development using any programming language that your group is most familiar with. (You will have all the necessary IR concepts and theories from lectures. The instructor will not spend time on discussing the implementation. For example, the instructor will discuss what is an inverted file and how it can be used for automatic indexing and retrieval, but not how to use Java or C++ to implement a program to generate it)
      • B. System design and documentation (including flow charts, user manual presented with screen shots, and the evaluation of system performance).
      • Part III: Presentation (5%)
      • How to prepare deliverables? Please click here.

     

     
    § Final Exam (30%):
      It covers all previous lectures and two booklets written by Dr. HsinChun Chen.
      top
    Resources
      1. from The Information Retrieval Group at University at Glasgow:

    2. from McCallum, Andrew Kachites, Computer Science Dept, Carnegie Mellon University

    3. from Neural Networks Research Centre, HELSINKI UNIVERSITY OF TECHNOLOGY

    4. from AI Lab, University of Arizona

      top

    § Schedule (subject to change, last updated January/07/2008) §
    Week
    Topic Readings Due Dates

    1

    01/24
    Course Logistic and Overview

    IR Academic Resources

     

    Slides: 1-1

     

    2

    01/31
    IR overview

    "What do people want from IR " by Croft

    Slides: 1-2

    Self-introduction on WebCT due Feb/03.

    3

    02/07
    Document and Query Forms

    Korfhage Ch1-2

    Van Rijsbergen Ch1-2

    Slides: 2-1, 2-2

    On-line search log due Feb10.

    4

    02/14
    Data Compression
    Query Structures

    Korfhage Ch2 - 3

    Slides: 2-2 (cont), 3-1

     

    5

    02/21

    Matching Process

    Text Analysis

    Korfhage Ch 3-5

    Rijsbergen Ch2 (The Zipf's Law Part only)

    Slides: 3-2

    IR competition

    in-class

     

    6

    02/28

    TF.IDF

    Basics of UNIX

    Experiencing an IR System

    Retrieval Experiment I Instructions

    Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering

    Slides: 3-2 (cont), 4-1, 4-2

    Finish reading Dr. Chen's book "Trailblazing a Path Towards Knowledge and Transformation"

     

     

    7

    03/06

    Overview of Final Projects

    Document Similarity

    Document and Concept Classification/Clustering Techniques

    Text Mining using Neural Networks

    Rijsbergen Ch3

     

    Slides: 5-1, 5-2 (quickly skim through), 6-1

    Retrieval Exp I due 03/09

    8

    03/13

    SOM and Nenet package overview (Retrieval Experiment II Instructions)

    Document Warehousing

    AI LAB: A Scalable Self-Organizing Map Algorithm for Textual Classification: A Neural Network Approach to Automatic Thesaurus Generation (Roussinov & Chen, 1998)

    Sullivan Ch1

    Slides: 6-2, 7-1

    Semester Project Proposal due 03/16
    Spring Break 03/17-23

    9

    03/27

    Information Extraction
    Text Mining Applications
    1. Sullivan Ch13
    2. Information Extraction: Techniques and Challenges, by Ralph Grishman

    3. An interactive system for finding complementary literatures: a stimulus to scientific discovery, Artificial Intelligence, Volume 91, Issue 2, April 1997, Pages 183-203 Don R. Swanson and Neil R. Smalheiser

    4. Combining Data and Text Mining Techniques For Analyzing Financial Reports

    Slides: 8-1, 8-2

     

    10

    04/03

    Retrieval Effectiveness Measures

    Output Presentation

    Korfhage Ch 8, 11

    Rijsbergen Ch 7

    Slides: 9-1, 9-2

    Retrieval Exp II due 04/06

    11

    04/10

    IR Effectiveness Improvement Techniques: Relevance Feedback, Query Expansion, Local Context Analysis, and Word-Sense Disambiguation

    Korfhage Ch 9

    Slides: 10-1

    Finish reading Dr. Chen's book "Knowledge Management Systems: A Text Mining Perspective"

    12

    04/17

    User Profiles

    Alternative Retrieval Techniques

    Korfhage Ch 6, 10

    S. Chakrabarti, B. Dom and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD 1998.

    J. Kleinberg. Authoritative sources in a hyperlinked environment.

    Slides: 11-1, 11-2

     

    13

    04/24

    Natural Language Processing

    Little Words Can Make a Big Difference for Text Classification by Ellen Riloff

    Slides: 12

     

    14

    05/01

    Final Project Week

    Final Project Due 04/30

     

    15

     

    Final Exam

    covering all previous lectures and the 2 booklets by Dr. HsinChun Chen.

     

    TBA
    top