Of The Day: Information Retrieval

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK INTEGRATED LEARNING PROGRAMMES

Digital Learning

Part A: Course Design

Course Title	Information Retrieval
Course No(s)	SS ZG537
Credit Units	4
Credit Model
Content Authors	Dr.V.Maheswari

Course Objectives

No	Course Objectives
CO1	To understand structure and organization of various components of an IR system
CO2	To understand information representation models, term scoring mechanisms, etc. in the complete search system
CO3	To understand architecture of search engines, crawlers and the web search
CO4	To understand cross lingual retrieval and multimedia information retrieval
CO5	To understand the concepts of Recommender Systems.

Text Book(s)

T1	C. D. Manning, P. Raghavan and H. Schutze. Introduction to Information Retrieval, Cambridge University Press, 2008. http://nlp.stanford.edu/IR-book/http://nlp.stanford.edu/IR-book/

Reference Book(s) & other resources

R1	Modern Information Retrieval, Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison-Wesley, 2000. http://people.ischool.berkeley.edu/~hearst/irbook/http://people.ischool.berkeley.edu/~hearst/irbook/
R2	Ricci, F.; Rokach, L.; Shapira, B.; Kantor, P.B. (Eds.), Recommender Systems Handbook. 1st Edition., 2011, 845 p. 20 illus., Hardcover, ISBN: 978-0-387-85819-7
R3	Cross-Language Information Retrieval by By Jian-Yun Nie Morgan & Claypool Publisher series 2010
R4	Multimedia Information Retrieval by Stefan M. Rüger Morgan & Claypool Publisher series 2010.
R5	Information Retrieval: Implementing and Evaluating Search Engines by S. Buttcher, C. Clarke and G. Cormack, MIT Press, 2010.
R6	Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data by B. Liu, Springer, Second Edition, 2011.
R7	Search Engines: Information Retrieval in Practice by Bruce Croft, Donald Metzler, and Trevor Strohman, Addison-Wesley, 2009.
R8	Koehn P., “Statistical Machine Translation”, Cambridge University Press, 2010.

Learning Outcomes:

No	Learning Outcomes
LO1	Students will gain understanding about an information retrieval system as a whole and about its components.
LO2	Students will have knowledge about the design issues and their solutions of different type of models including Boolean, vector space etc.
LO3	Students will have detailed understanding about text indexing, mining, weighting schemes etc.
LO4	Students will acquire knowledge about cross lingual and multimedia information retrieval.
LO5	Differentiate between different recommender systems and suggest a suitable system based on the problem and data available.
LO6	With the acquired knowledge students will be able to design and build different kind of information retrieval systems.

Content Structure

Modular Content Structure

No	Title of the Module
M1	Introduction, Boolean Retrieval
M2	Boolean Retrieval, Term vocabulary and postings list
M3	Postings list, Phrase queries, Dictionaries
M4	Tolerant Retrieval, Index Construction
M5	Index Construction, Vector Space Model
M6	Vector Space Model, Text Classification
M7	Vector space classification, Evaluation of classification
M8	Text Clustering
M9	Web Search, Web Crawler, Link Analysis
M10	Multimedia IR, Cross-Lingual IR

Glossary of Terms:

1. Contact Hour (CH) stands for a hour long live session with students conducted either in a physical classroom or enabled through technology. In this model of instruction, instructor led sessions will be for 20 CH.

a. Pre CH = Self Learning done prior to a given contact hour

b. During CH = Content to be discussed during the contact hour by the course instructor

c. Post CH = Self Learning done post the contact hour

2. RL stands for Recorded Lecture or Recorded Lesson. It is presented to the student through an online portal. A given RL unfolds as a sequences of video segments interleaved with exercises

3. SS stands for Self-Study to be done as a study of relevant sections from textbooks and reference books. It could also include study of external resources.

4. LE stands for Lab Exercises

5. HW stands for Home Work will consists could be a selection of problems from the text.

Part B: Contact Session Plan

Academic Term	Second Semester 2017-2018
Course Title	Information Retrieval
Course No	SS ZG537
Content Developer	Dr. Maheswari .V

M1: Introduction, Boolean Retrieval

Type	Description
RL 1.1	RL 1.1.1 - Information vs. Data Retrieval RL 1.1.2 – IR task
RL 1.2	RL 1.2.1 - Basic Concepts RL 1.2.2 - Logical view of the documents RL 1.2.3 – The Retrieval process
RL 1.3	RL 1.3.1 -- Classical IR models RL 1.3.2 - Boolean model RL 1.3.3 - Vector space model RL 1.3.4 – Probabilistic model
RL 1.4	RL 1.4.1 – Example of an IR problem RL 1.4.2 – Term document Matrix RL 1.4.3 -Boolean Queries, Inverted Index
CS 1.1	CS 1.1.1 - Difference between Information and Data Retrieval, Information need, IR task CS 1.1.2 – Relevance , Motivation of IR CS 1.1.3 – Examples of different IR models
CS 1.2	CS 1.2.1 – Example of Term document Matrix, Building Inverted Index CS 1.2.2 – Processing Boolean queries – related examples
LE1.1
SS1.1	R1- Ch1, Ch2, T1- Ch2
HW1.1	Exercises at the end of T1 - Ch2
QZ1.1

M2: Boolean Retrieval, Term vocabulary and postings list

Type	Description
RL 2.1	RL 2.1.1 – Query Optimization RL 2.1.2 - Term Vocabulary-Preprocessing
RL 2.2	RL 2.2.1 – Vocabulary of terms- Tokenization, Stop Words RL 2.2.2 – Vocabulary of terms- Normalization, Stemming and Lemmatization
CS 2.1	CS 2.1.1 – Intersection of postings list and query optimization CS 2.1.2 – Preprocessing- document delineation
CS 2.2	CS 2.2.1 – Tokenization CS 2.2.2 – Stop words, Normalization CS 2.2.3 – Stemming and Lemmatization
LE 2.1
SS 2.1	T1 - Ch2
HW 2.1,	Exercises at the end of T1- Ch2
QZ 2.1

M3: Postings list, Phrase queries, Dictionaries

Type	Description
RL 3.1.	RL 3.1.1 – Faster postings list RL 3.1.2 - Postings lists intersection with skip pointers
RL 3.2	RL 3.2.1 – Positional postings and phrase queries RL 3.2.2 – Handling phrase queries
RL 3.3	RL 3.3.1 – Dictionary data structure RL 3.3.2 – Tolerant retrieval – wildcard queries
CS 3.1	CS 3.1.1 – Faster postings list via skip pointers CS 3.1.2 – Skip pointers example, placing skip pointers
CS 3.2	CS 3.2.1 – Phrase queries – examples CS 3.2.2 – Biword indexes – example, extended biwords, issues CS 3.2.3 – Positional indexes
CS 3.3	CS 3.3.1 - Dictionary data structure – Hash table, Tree CS 3.3.2 – Tolerant Retrieval- Wildcard Queries CS 3.3.3 – Wildcard query processing – permuterm indexes CS 3.3.4 - Wildcard query processing - k-gram indexes
LE3.1
SS3.1	T1 - Ch2, Ch3
HW3.1	Exercises at the end of T1- Ch2, Ch3
QZ3.1

M4: Tolerant Retrieval

Type	Description
RL 4.1	RL 4.1.1 – Spelling correction- isolated word correction RL 4.1.2 - Edit distance (Levenshtein distance – computation)
RL 4.2	RL 4.2.1 – Using edit distance, k-gram index RL 4.2.2 – Context sensitive spell correction RL 4.3.1 – Phonetic correction, Soundex
CS 4.1	CS 4.1.1 – Isolated word correction – standard lexicon, lexicon of indexed corpus CS 4.1.2 – Edit distance- example CS 4.1.3 – Computation of Levenshtein distance
CS 4.2	CS 4.2.1- Context sensitive spell correction CS 4.2.2 - Phonetic correction, Soundex algorithm, computation of Soundex codes
LE4.1
SS4.1	T1- Ch3
HW4.1	Exercises at the end of T1- Ch3
QZ4.1	Compute Edit distance between “side” and “slide”

M5: Index Construction, Vector Space Model

Type	Description
RL 5.1	RL 5.1.1 - Blocked sort-based Indexing RL 5.1.2 - Single‐Pass In‐Memory Indexing
RL 5.2	RL 5.2.1 – Distributed Indexing RL 5.2.1 – Dynamic Indexing
RL 5.3	RL 5.3.1 - Term frequency and weighting RL 5.3.2 - The vector space model for scoring
CS 5.1	CS 5.1.1 - BSBI, BSBI index construction algorithm CS 5.1.2 – SPIMI algorithm, Merits and Demerits of both algorithms
CS 5.2	CS 5.2.1 - Distributed Indexing, Examples of search engines using distributed indexing CS 5.2.2 – Mapreduce architecture for distributed indexing, Parsers, Inverters, Schema for index construction in Mapreduce CS 5.2.3 – Dynamic Indexing, Issues with main and auxiliary indexing, Logarithmic merge, Dynamic Indexing used in search engines
CS 5.3	CS 5.3.1 – Boolean retrieval issues, Ranked retrieval models, query – document matching scores CS 5.3.2 – Bag of words model., Term frequency-tf, document frequency-idf, effect of idf on ranking, tf-idf weighting
LE5.1
SS5.1	T1 - Ch4,Ch6
HW5.1	Exercises at the end of T1 - Ch4, Ch6
QZ5.1

M6: Vector Space Model, Text Classification

Type	Description
RL 6.1	RL 6.1.1 – Documents and Queries as vectors, Similarity measure using angle, cosine RL 6.1.2 – Cosine similarity for length normalized vectors RL 6.1.3 - Computing Cosine scores, tf-idf weighting variants
RL 6.2	RL 6.2.1 – Text classification problem RL 6.2.2 – Classification methods
RL 6.3	RL 6.3.1 – Bayesian method of text classification RL 6.3.2 – Bernoulli Model RL 6.3.3 – Time complexity of NB
CS 6.1	CS 6.1.1 – Difference between binary, count and weight Matrix, representation of documents and queries as vectors, Similarity measure using angle, its disadvantage. CS 6.1.2 – cosine similarity measurement between query and document, advantage of using length normalized vectors, , example of cosine similarity computation CS 6.1.3 – Algorithm for computing cosine scores, Weighting in queries vs. documents- Examples.
CS 6.2	CS 6.2.1 – Introduction to Text classification problem, examples of search engines using text classification, Supervised learning CS 6.2.2 – Classification methods – manual, rule-based, statistical
CS 6.3	CS 6.3.1 – Relevance feedback, Bayes rule, Bayes rule for text classification CS 6.3.2 – NB classifier model, Conditional Independence assumption, Parameter estimation, Positional Independence assumption CS 6.3.3 – The problem with maximum likelihood estimates- Zeros, Add-one smoothing, Naive Bayes -Training, Testing, Examples CS 6.3.4 – Time complexity
LE6.1
SS6.1	T1 – Ch6,Ch13
HW6.1	Exercises at the end of T1 – Ch6, Ch13
QZ6.1

M7: Vector space classification, Evaluation of classification

Type	Description
RL 7.1.	RL 7.1.1 – Vector space representation of classes RL 7.1.2 – Rocchio classification
RL 7.2	RL 7.2.1 – k Nearest Neighbour Classification RL 7.2.2 - Evaluation of classification
CS 7.1	CS 7.1.1 - Vector Space Representation, Contiguity hypothesis, Classes in the Vector Space, Vector Space Classification Methods CS 7.1.2 - Illustration of Rocchio Text Categorization, Definition of centroid, Rocchio algorithm, Example
CS 7.2	CS 7.2.1 – Explanation about k Nearest Neighbour Classification, kNN example CS 7.2.2 – kNN algorithm, impact of parameter k, solved example. CS 7.2.3 - Precision P and recall R,F1 measure, Averaging: Micro vs. Macro
LE1.1
SS1.1	T1- Ch14
HW1.1	Exercises at the end of T1-Ch14
QZ1.1

M8: Text Clustering

Type	Description
RL 8.1	RL 8.1.1 – Clustering concepts RL 8.1.2 – Types of clustering Flat vs. Hierarchical clustering
RL 8.2	RL 8.2.1 – Flat algorithms – K means
RL 8.3	RL 8.3.1 - Hierarchical clustering (HAC) RL 8.3.2 – Cluster similarity RL 8.3.3 – Variants of HAC
CS 8.1	CS 8.1.1 – Classification vs. Clustering, Cluster hypothesis, Applications of Clustering in IR, examples of search engines CS 8.1.2 - Flat vs. Hierarchical clustering, Hard vs. Soft clustering CS 8.1.3 – Basic idea of K-means, RSS, K-means algorithm, Example CS 8.1.4 – seed choice, time complexity and optimality of K-means
CS 8.2	CS 8.2.1 – HAC algorithm, Dendrogram CS 8.2.2 – Single Link and Complete link HAC- examples CS 8.2.3 – Centroid and Group Average – examples
LE1.1
SS1.1	T1-Ch16,Ch17
HW1.1	Exercises at the end of T1-Ch16,Ch17
QZ1.1

M9: Web Search, Web Crawler, Link Analysis

Type	Description
RL 9.1	RL 9.1.1 - Web Search Basics RL 9.1.2 – Web documents RL 9.1.3 – Duplicate detection
RL 9.2	RL 9.1.1 – Web crawler operation RL 9.1.2 – Crawling process RL 9.1.3 – Crawler Architecture
RL 9.3	RL 9.3.1 – Link analysis RL 9.3.2 – Page Rank RL 9.3.3 – Hubs and authorities
CS 9.1	CS 9.1.1 – Ads and Web search, Web IR: Differences from traditional IR User Needs in Web search, Bowtie structure of the web CS 9.1.2 - Web documents, Dynamic content, Multilinguality, spam techniques CS 9.1.3 – Near Duplicate detection example, computation of jaccard for sketches CS 9.1.4 – Size of the web issues
CS 9.2	CS 9.2.1 – Web Crawler Architecture- Mercator Scheme CS 9.2.2 – Distributed Indexes- Web CS 9.2.3 – Page rank algorithm – illustration with an example CS 9.2.4 - Hubs and authorities– illustration with an example
LE1.1
SS1.1	T1 – Ch19, Ch20, Ch21
HW1.1	Exercises at the end of T1 – Ch19, Ch20, Ch21
QZ1.1

M10: Multimedia IR, Cross-Lingual IR

Type	Description
RL 10.1	RL 10.1.1 – Multimedia IR- Introduction, Feature Extraction RL 10.1.2 – Search Techniques
RL 10.2	RL 10.2.1 – Cross-Lingual IR RL 10.2.2 – Query Translation methods
CS 10.1	CS 10.1.1 – Examples of spatial access methods in Multimedia IR CS 10.1.2 – Examples of image, video and audio feature extraction
CS 10.2	CS 10.2.1 – Examples of query translation methods CS 10.2.2 – Examples of parallel corpora and comparable corpora
CS 10.3	CS 10.3.1 – Recommender systems
LE10.1
SS10.1	R3,R4
HW10.1
QZ10.1

Contact hour	Pre-contact hour prep	During Contact hour	Post-contact hour
1	None	CS 1.1
2	RL 1.1, RL 1.2, RL 1.3	CS1.2	HW 1.1
3	RL2.1	CS2.1	HW 2.1
4	RL2.2	CS2.2	HW 2.1
5	RL3.1,3.2	CS3.1,CS3.2	HW 3.1
6	RL3.3	CS3.3	HW3.1
7	RL4.1, RL4.2	CS4.1, CS4.2	HW4.1,QZ4.1
8	RL4.4	CS4.2	LE4.1, SS4.1, HW4.1
9	RL5.1, RL5.2,RL5.3	CS5.1, CS5.2,CS5.3	HW 5.1
10	RL 6.1	CS6.1	HW6.1
11		Review
12	RL6.2, RL6.3	CS6.2,CS6.3	HW 6.1
13	RL7.1,RL7.2	CS7.1	HW 7.1
14	RL8.1, RL8.2	CS8.1	HW 8.1
15	RL8.2,RL8.3	CS8.2	HW8.1
16	RL9.1,9.2	CS9.1,CS9.2	HW9.1
17	RL 9.3	CS9.2	HW9.1
18	RL10.1	CS10.1
19	RL10.2	CS10.2
20	-	CS10.3
21		Review
22		Review

Evaluation Scheme:

Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session

No	Name	Type	Duration	Weight	Day, Date, Session, Time
EC-1	Quiz-I	Online	-	5%	February 1 to 10, 2018
	Quiz-II	Online		5%	March 1 to 10, 2018
	Quiz-III	Online		5%	March 20 to 30, 2018
EC-2	Mid-Semester Test	Closed Book	2 hours	35%	03/03/2018 (FN) 10 AM – 12 Noon
EC-3	Comprehensive Exam	Open Book	3 hours	50%	21/04/2018 (FN) 9 AM – 12 Noon

Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 11.

Syllabus for Comprehensive Exam (Open Book): All topics (Session Nos. 1 to 22)

Important links and information:

Elearn portal: https://elearn.bits-pilani.ac.in

Students are expected to visit the Elearn portal on a regular basis and stay up to date with the latest announcements and deadlines.

Contact sessions: Students should attend the online lectures as per the schedule provided on the Elearn portal.

Evaluation Guidelines:

1. EC-1 consists of either two Assignments or three Quizzes. Students will attempt them through the course pages on the Elearn portal. Announcements will be made on the portal, in a timely manner.

2. For Closed Book tests: No books or reference material of any kind will be permitted.

3. For Open Book exams: Use of books and any printed / written reference material (filed or bound) is permitted. However, loose sheets of paper will not be allowed. Use of calculators is permitted in all exams. Laptops/Mobiles of any kind are not allowed. Exchange of any material is not allowed.

4. If a student is unable to appear for the Regular Test/Exam due to genuine exigencies, the student should follow the procedure to apply for the Make-Up Test/Exam which will be made available on the Elearn portal. The Make-Up Test/Exam will be conducted only at selected exam centres on the dates to be announced later.

It shall be the responsibility of the individual student to be regular in maintaining the self study schedule as given in the course handout, attend the online lectures, and take all the prescribed evaluation components such as Assignment/Quiz, Mid-Semester Test and Comprehensive Exam according to the evaluation scheme provided in the handout.

↑

Of The Day

Information Retrieval - Course Handout

No comments:

Post a Comment

Popular Posts

Blog Archive