home | research | awards | grants | teaching | students | contact
IS 634 Fall 2009
  • Instructor
  • Contact the Instructors:
  • Class Moodle
  • Introduction
  • Class Conduct and Attendance
  • Textbooks
  • Assignments, Grading, and Due Dates
  • § Participation (10%)
    § Online Search Logs (5%)
    § Retrieval Experiments (20%)
    § Semester Group Project (35%)
    § Final Exam (30%)

  • Resources
  • Schedule

  • Instructor
     

    Instructor and Coordinator of this course: Professor Yi-Fang Wu, Ph.D. (a.k.a Brook Wu)

      top
    Contact the Instructor:
     

    Dr. Wu:

    • In person: GITC 4104, Hours: Monday & Wednesday 2:30pm -- 4 pm
    • Online: Yahoo messenger ID: profbrookwu, Hours: Monday and Wednesday 2:30-4pm. (Do NOT send offline messages. Please use e-mail when I am not on Yahoo! messenger.)
    • By Phone (during in person office hour; please do not leave voice messages): (973) 596 - 5285
    • By e-mail (fastest way to get a response): wu@njit.edu

     

      top
    Class Moodle (will be completely setup soon)
     

    We will be using moodle for discussion, assignment submission and grading.

    Visit http://moodle.njit.edu and please follow the tutorials, if you are not familiar with moodle's user interface.

    After logining with your UCID. You should see IS 634 Section 851.

      top
    Introduction
     

    Information retrieval (IR) is a fast-changing field concerning the representation, organization, storage, and retrieval of information items. A broader definition of data types of information items includes text, numbers, multimedia and more, while a narrower definition includes only text. The instructor recognizes the fact that there are courses offered for other types of retrievals, so this course will focus mostly on text retrieval.

    The importance of text retrieval is obvious. Most business data is in textual format. However, most text is not as well organized as numerical data stored in commercial databases. This, along with linguistic complexity, has caused the problems in text retrieval. To achieve high retrieval effectiveness, techniques such as automatic indexing, query expansion, local context analysis, information extraction, text mining, and many more have been developed to overcome problems in IR. As an information professional, you should know how to use these techniques to organize, store, and retrieve text effectively and efficiently. This course is designed to address both theories and practices of IR.

    Pre-requisite for IS 634: IS 631

      top
    Class Conduct and Attendance:
     

    This is a graduate course. As an NJIT graduate student, you must follow the Institute's honor codes. Please refer to student handbook for details. Specifically:

    • Academic dishonesty is not allowed and will be reported to Dean of Student Services.
    • All written assignments will be sent to www.turnitin.com, a plagiarism prevention system, for verification.
    • Late assignments will be penalized 25% a day. No late submissions will be allowed for the final semester project, and other major deliverables.
    • You are required to check announcements on class moodle at least 3 times a week.
    • When posting messages on class web board, respect others and be considerate.
    • If you have to drop the class after you are assigned to a project group, please notify the instructor and your group members immediately.
      top
    Textbooks
     

    Supplemental readings are available through the links on the course schedule below.

      top
    Assignments and Grading: (subject to change)
     

    Note: All written assignments should be word-processed and posted on web board. Do not submit hard-copies. The group presentation slides should be posted as Power Point attachments. Remember to put your name on your assignments, but do not list your social security number!! Last but not least, all written assignments should include proper citations!

    1. Participation 10%

      • Self-introduction and reply to one other student's intro (1%)
      • IR system design competition (6%)
      • Class participation on Moodle(3%).

    2. Assignment 5%
      • An Online Search Log (5%)

    3. Two Retrieval Experiments 20%

      • RE I (10%)
      • RE II (10%)

    4. Semester Project 35%

      • Proposal (5%)
      • Project and presentation (30%)

    5. Final Exam 30%

      • Final Exam (30%)

    Total 100%
      top
    Following describes the assignments and projects.
     
    § Participation (10%):
     
    • Self- introduction on Moodle. Please introduce yourself in "Introductions" conference on class Moodle. Reply to and get to know at least one of the classmates.(1%)

    • IR design competitions.(6%): Each student works individually to design an IR system using flowcharts. The system should have two main components: automatic indexing, and retrieval. For automatic indexing, include the 3 components in Lecture 3, Part 2, slide 3. For retrieval, please use Boolean retrieval where only AND and OR operators are used. Use Visio or the drawing function in Word to draw the flow charts.

    • Class participation on Moodle: The instructor will post topics for discussions on web board. Please respond to them. You are also welcome to provide answers to other students' questions as well. Respect others and be considerate when you respond to postings!! Constructive postings are examples of "good participation," not the number of postings.(3%)

     
    § An Online Search Log (5%):
     

    This assignment appears to be easy and straightforward. However, if you carefully record the whole procedure, you shall find many IR problems during the whole exercise. TThe instructor will show you how this exercise relates to other IR topics later in the semester.

    • Part I: Search and record any useful information for the task assigned (see below). Use as many resources as you can, including web search engines, web directories, or electronic journal databases available at NJIT library web site. (The instructor might not have access to other resources.)
    • Task: Please find necessary information on "How can text mining contribute to competitive advantage?" (Note: 1. The sentence describing the task should not be the only query you use to search for information. You should come up with your own queries. Through this experience, you would learn why it is difficult for users to find information. 2. Even if you know the answer without the need to search for information, please pretend you don't know the answer and go on with the assignment.)

      For each search session, be sure to record the following items along the search process:

      1. The search engine/directory/electronic database you used
      2. The search query you entered. (If you entered several queries to one search tool, treat them as different sessions.)
      3. Total number of search hits returned.
      4. In the top twenty returned documents, find out the number of documents/hits actually relevant to your query and their URLs (or document titles, if an electronic database is used).
      5. Among all returned hits, what is the number of returned documents you browsed through?

      Note:

      • If you use more than one query for a particular search engine, please record item 2, 3, 4, and 5, repeatedly.
      • If you do not find useful information and decide to use other resources, please repeat the above steps.
      • Clearly mark the URL or title of the best site/best paper obtained from your search.

    • Part II: Read "What Do People Want from IR" by Croft. Based on your experience as a web search engine and/or text database user, write a short paper (2 single-spaced pages) called "What Do I need from IR?" List at least, but not limited to, 3 items. You do not need to pick items from Croft's paper for discussions. For each item, please provide examples of frustrations that you experienced from the first part of this assignment.

     
    § Retrieval Experiments (20%):
     

    Note that each student will work individually and will use a unique query; therefore, none of the results of your experiments will be identical. Results of both retrieval experiments should be posted on your own web site. Note: To gain access to the test collections and the IR system, all students should get an AFS account. Please follow instructions on http://newaccount.njit.edu/. Please direct all your questions to Computing Help Desk 973-596-2900.

    Assigned queries for your RE I and II

    (The table will be filled out soon after withdraw deadline. Please refer to LISA.QUE to find out the exact query.)

    Q1
    Aly, F
    Q2
    Bellamy, W
    Q3
    Bleik, S
    Q4
    Carmen, T
    Q5 Professor Wu
    Q6
    Chiu, L
    Q7
    Field, C
    Q8
    Gutta, H
    Q9
    Hussain, M
    Q10
    Hytmiah, P
    Q11
    Jha, F
    Q12
    Kruaysiriwong, D
    Q13
    Mahipan, C
    Q14
    Malamug, D
    Q15
    Nersesian, E
    Q16
    Paredes Gomero, C
    Q17
    Sultana, B
    Q18
    Uplenchwar, V
    Q19
    Watrous-deVersterre, L
    Q20
    Wong, C
    Q24
    Yang, K
    Q25
    Yang, M
    Q26
    Zitzler, G
    Q27  
    Q28  

    Q29

     

    Q30

     

    Q31

     

    Q32

     

    Q33

     

    Q34

     

    Q35

     
       

    • RE I (10%): You are required to use an IR system, a test collection, and a query, and then run several retrieval experiments using Arrow of BOW toolkit (see Resources section of the syllabus below). Instructions are in Lecture 4-2 slides. To present the results of your Retrieval Experiment 1, please use this template: http://web.njit.edu/~wu/CIS634/CIS634re1.html

    • RE II (10%): You are required to create a small document collection and a document-term matrix based on the results from your RE1, and perform documents classification/clustering analysis using Rainbow and Nenet (see Resources below). Instructions are in Lecture 6-2 slides. To present the results of your Retrieval Experiment 2, please use this template: http://web.njit.edu/~wu/CIS634/CIS634re2.html

     
    § Semester Group Project (35%):
     

    Two options: case analysis project or programming project. Instructions here.

    • Option 1: Case analysis:

    • You are required to make up a fictional client and its business problems.

      • Part I (5%): A proposal defining your client, its business problems, business environment (competitors and competing products, if any) and at least 2 possible sources of documents for text analysis. Please discuss the sources of documents: why do you choose them? do they have enough documents for this project? how are the documents generated? (e.g. user reviews posted by real customers, or expert reviews , etc). Please provide as much details as possible.
      • Part II (25%): Use text mining software programs to collect, pre-process, index, mine and analyze the documents you collected. Deliverables: 1. All documents you collected; 2. Power Point slides for presentation; 3. A final report containing a. 1 page execute summary, b. main report: tools used and screen shots of outputs, c. analysis of results and d. recommended solutions to the client's business problems.
      • Part III: Presentation (5%)
      • How to prepare deliverables? Please click here.

      Good case analysis projects by former students


      Option 2: Programming Project

    • You can design your own project with the instructor's approval. Sample choices are: text retrieval systems, information extraction systems, automatic summarization systems, etc.

      • Part I (5%): A proposal describing your project, including systems functions and tools (programming languages, databases, etc) that you will use to develop the system. Most importantly, please use flow charts to demonstrate how your system works. Please provide as much details as possible.
      • Part II (30%):
      • A. System development using any programming language that your group is most familiar with. (You will have all the necessary IR concepts and theories from lectures. The instructor will not spend time on discussing the implementation. For example, the instructor will discuss what is an inverted file and how it can be used for automatic indexing and retrieval, but not how to use Java or C++ to implement a program to generate it)
      • B. System design and documentation (including flow charts, user manual presented with screen shots, and the evaluation of system performance).
      • Part III: Presentation (5%)
      • How to prepare deliverables? Please click here.

     

     
    § Final Exam (30%):
      It covers all previous lectures and two booklets written by Dr. HsinChun Chen.
      top
    Resources
      1. from The Information Retrieval Group at University at Glasgow:

    2. from McCallum, Andrew Kachites, Computer Science Dept, Carnegie Mellon University

    3. from Neural Networks Research Centre, HELSINKI UNIVERSITY OF TECHNOLOGY

    4. from AI Lab, University of Arizona

      top

    § Schedule (subject to change) §
    Week
    Topic Readings

    1

    Aug/31

    Course Logistic and Overview

    IR Academic Resources

     

    Slides: 1-1

    2

    Sept/07

    IR overview

    "What do people want from IR " by Croft

    Slides: 1-2

    3

    Sept/14

    Document and Query Forms

    Korfhage Ch1-2

    Van Rijsbergen Ch1-2

    Slides: 2-1, 2-2

    4

    Sept/21

    Data Compression
    Query Structures

    Korfhage Ch2 - 3

    Slides: 2-2 (cont), 3-1

    5

    Sept/28

     

    Matching Process

    Text Analysis

    Korfhage Ch 3-5

    Rijsbergen Ch2 (The Zipf's Law Part only)

    Slides: 3-2

    6

    Oct/05

     

    TF.IDF

    Basics of UNIX

    Experiencing an IR System

    Retrieval Experiment I Instructions

    Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering

    Slides: 3-2 (cont), 4-1, 4-2

     

    7

    Oct/12

     

    Overview of final project

    Document Similarity

    Document and Concept Classification/Clustering Techniques

    Text Mining using Neural Networks

    Rijsbergen Ch3

     

    Slides: 5-1, 5-2 (quickly skim through), 6-1

    8

    Oct/19

     

    SOM and Nenet package overview (Retrieval Experiment II Instructions)

    Document Warehousing

    AI LAB: A Scalable Self-Organizing Map Algorithm for Textual Classification: A Neural Network Approach to Automatic Thesaurus Generation (Roussinov & Chen, 1998)

    Sullivan Ch1, Ch4

    Slides: 6-2, 7-1

    9

    Oct/26

     

    Information Extraction
    Text Mining Applications
    1. Sullivan Ch13
    2. Information Extraction: Techniques and Challenges, by Ralph Grishman

    3. An interactive system for finding complementary literatures: a stimulus to scientific discovery, Artificial Intelligence, Volume 91, Issue 2, April 1997, Pages 183-203 Don R. Swanson and Neil R. Smalheiser

    4. Combining Data and Text Mining Techniques For Analyzing Financial Reports

    Slides: 8-1, 8-2

    10

    Nov/02

     

     

    Retrieval Effectiveness Measures

    Output Presentation

    Korfhage Ch 8, 11

    Rijsbergen Ch 7

    Slides: 9-1, 9-2

    11

    Nov/09

     

    IR Effectiveness Improvement Techniques: Relevance Feedback, Query Expansion, Local Context Analysis, and Word-Sense Disambiguation

    Korfhage Ch 9

    Slides: 10-1

    12

    Nov/16

     

    User Profiles

    Alternative Retrieval Techniques

    Korfhage Ch 6, 10

    S. Chakrabarti, B. Dom and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD 1998.

    J. Kleinberg. Authoritative sources in a hyperlinked environment.

    Slides: 11-1, 11-2

    13

    Nov/23

     

    Natural Language Processing

    Little Words Can Make a Big Difference for Text Classification by Ellen Riloff

    Slides: 12

    14

    Nov/30

     

    Final Project Week Final Project Due Dec/06

    NO late projects will be accepted.

     

    15

     

    Final Exam

    covering all previous lectures and the 2 booklets by Dr. HsinChun Chen.

    Take home exam, will be distributed on Moodle.

    Dec/10, 5pm - Dec/11, 3 am

    (The actual exam will not take you that long, if you are well-prepared. This 7-hour duration gives flexibility for students with kids or who can't start on time.)

    NO late exams will be accepted.

     

     

    top