Jun Fu: February 2015

Friday, February 20, 2015

Reading Notes For Week 7

The idea of relevance feedback (RF) is to involve the user in the IR process so as to improve the final result set.

The Rocchio algorithm is the classic algorithm for implementing relevance feedback. It mod- els a way of incorporating relevance feedback information into the vector space model. However, we don’t know the truly relevant docs

The success of RF depends on certain assumptions. First, the user has to have sufficient knowledge to be able to make an initial query that is at least some- where close to the documents they desire. Second, the RF approach requires relevant documents to be similar to each other.

Pseudo relevance feedback (blind relevance feedback): It automates the manual part of RF, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that the top k ranked documents are rel- evant, and finally to do RF as before under this assumption.

In query expansion, on the other hand, users give additional input on query words or phrases, possibly suggesting additional query terms.

Methods for building a thesaurus for query expansion
(1)Use of a controlled vocabulary that is maintained by human editors. Here, there is a canonical term for each concept.
(2)A manual thesaurus. Here, human editors have built up sets of synony- mous names for concepts, without designating a canonical term.
(3)An automatically derived thesaurus.
(4)Query reformulations based on query log mining

Query expansion is often effective in increasing recall. However, there is a high cost to manually producing a thesaurus and then updating it for sci- entific and terminological developments within a field.

Muddiest Points For Week 7

(1) how can we compute λ linear interpolation Smoothing?
p ( w | D ) = λ p ( w | M D ) + (1 − λ ) p ( w | M C ).
Can we use cross validation to compute λ?

(2) Dirichlet prior smoothing is a special type of Maximum a posterior estimate using prior knowledge?

Tuesday, February 10, 2015

Reading Notes For Week 5

The standard approach to IR system evaluation revolves around the notion of relevant and nonrelevant documents.

Evaluation of unranked retrieval sets: (1)The two most frequent and basic measures for information retrieval effectiveness are precision and recall.

(2)The measures of precision and recall concentrate the evaluation on the return of true positives, asking what percentage of the relevant documents have been found and how many false positives have also been returned.

(3)The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances.

Evaluation of ranked retrieval results: (1) Precision, recall, and the F measure are set-based measures. They are com- puted using unordered sets of documents.
(2)In a ranked retrieval con- text, appropriate sets of retrieved documents are naturally given by the top k retrieved documents.
(3)An ROC curve plots the true positive rate or sensitivity against the false-positive rate or (1 − specificity).

(1)Sensitivity is just another term for recall. The false-positive rate is given by fp/(fp + tn).
(2)Specificity, given by tn/( f p + tn), was not seen as a very useful notion. Because the set of true negatives is always so large, its value would be al- most 1 for all information needs.

The success of an IR system depends on how good it is at satisfying the needs of these idiosyncratic humans, one informa- tion need at a time.

Marginal relevance is a better measure of utility for the user.

Evaluation at large search engine:non-relevance-based measures
(1) Clickthrough on first result
(2) Studies of user behavior in the lab
(3) A/Btesting

Monday, February 9, 2015

Muddiest Points For Week 6

The assumption of unigram language model is like conditional independence. Can we view this model as a special case of Naive Bayes?

For the estimation of language model, can we see the process of estimation as a machine learning application? In this case, we design a model with its assumptions and the data, which is the text. Then we use the data to train the model and predict on future text. The more data we have, the more predicting accuracy on future data.

Friday, February 6, 2015

Reading Notes For Week 5

Probability ranking principle: Using a probabilistic model, the obvious order in which to present doc- uments to the user is to rank documents by their estimated probability of relevance with respect to the information need: P(R = 1|d, q).

Bayes optimal decision rule: If a set of retrieval results is to be returned, rather than an ordering, the Bayes optimal decision rule, the decision that minimizes the risk of loss, is to simply return documents that are more likely relevant than nonrelevant:
d is relevant iff P(R = 1|d, q) > P(R = 0|d, q).

In machine learning and probability, the bayes decision rule can reduce the error for the whole class of response variable. In this case, R = 1 & R = 0.

Binary Independence Model: Documents and queries are both represented as binary term incidence vectors. A document d is represented by the vector x⃗ = (x1, ..., xM) where xt = 1 if term t is present in document d and xt = 0 if t is not present in d.

Naive Bayes assumption is very important in modeling process. Conditional independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word. In such a assumption, the computation is simple and intuitive and we do not need to compute the conditional probability and joint probability.

MLE makes the observed data maximally likely. MAP uses the prior knowledge about the distribution. We choose the most likely point value for probabilities based on the prior and the ob- served evidence.

Bayesian networks: a form of probabilistic graph- ical model.

Generative model: A traditional generative generative model of a language, of the kind familiar from formal language model theory, can be used either to recognize or to generate strings.

Language model: a function that puts a probability measure over strings drawn from some vocabulary.

Query likelihood model: rank documents by model P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query.

A translation model lets you generate query words not in a document by translation to alternate terms with similar meaning. This also provides a basis for performing cross-language IR.

Muddiest Points For Week 5

proximity operators. Is this about abbreviation and other form to represent a whole word. Like #od2(information retrieval)

the difference between document frequency and collection frequency:
The document frequency is the number of documents that obtain a certain term in just a collection.
The collection frequency is about the number of occurrences of t in the collection. In this case, it means several collections or just one collection
Are these correct?