Jun Fu

Friday, March 20, 2015

Thursday, March 12, 2015

The user interface role:
(1) aid in the searchers’ understanding and expression of their information need
(2) formulate their queries
(3) select among available information sources
(4) understand search results
(5) keep track of the progress of their search

User interaction with search interfaces differs depending on
(1) the type of task
(2) the domain expertise of the information seeker
(3) the amount of time and effort available to invest in the process

User interaction with search interfaces can be divide into two distinct parts
(1) information lookup
(2) exploratory search, including learning and investigating tasks

Classic Model VS. Dynamic Model
Classic notion of the information seeking process:
(1) problem identification
(2) articulation of information need(s)
(3) query formulation
(4) results evaluation

More recent Models emphasize the dynamic nature of the search process
(1) The users learn as they search
(2) Their information needs adjust as they see retrieval results and other document surrogates

The primary methods for searcher to express their information need:
(1) entering words into a search entry form
(2) selecting links from a directory or other information organization
display

The document surrogate refers to the information that summarizes the document.
(1) This information is a key part of the success of the search interface
(2) The design of document surrogates is an active area of research and experimentation
(3) The quality of the surrogate can greatly effect the perceived relevance of the search results listing

Tools to help users reformulate their query
(1) One technique consists of showing terms related to the query or
to the documents retrieved in response to the query
(2) A special case of this is spelling corrections or
suggestions

Organizing Search Results:
(1) Category system: meaningful labels organized in such a way as to reflect the concepts relevant to a domain
(2) Clustering refers to the grouping of items according to some measure of similarity

Visualizing Query Terms
(1) Understanding the role of the query terms within the retrieved docs can help relevance assessment
(2) In the TileBars interface, for instance, documents are shown as horizontal glyphs

Words and Docs Relationships
(1) Numerous works proposed variations on the idea of placing words and docs on a two-dimensional canvas
(2) Another idea is to map docs or words from a very high- dimensional term space down into a 2D plane

Muddiest Points For Week 7

How do we use Interpolation to describe the relationship between precision and recall using graph?

Can we talk about relevance feedback in language model?

Friday, February 20, 2015

Reading Notes For Week 7

The idea of relevance feedback (RF) is to involve the user in the IR process so as to improve the final result set.

The Rocchio algorithm is the classic algorithm for implementing relevance feedback. It mod- els a way of incorporating relevance feedback information into the vector space model. However, we don’t know the truly relevant docs

The success of RF depends on certain assumptions. First, the user has to have sufficient knowledge to be able to make an initial query that is at least some- where close to the documents they desire. Second, the RF approach requires relevant documents to be similar to each other.

Pseudo relevance feedback (blind relevance feedback): It automates the manual part of RF, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that the top k ranked documents are rel- evant, and finally to do RF as before under this assumption.

In query expansion, on the other hand, users give additional input on query words or phrases, possibly suggesting additional query terms.

Methods for building a thesaurus for query expansion
(1)Use of a controlled vocabulary that is maintained by human editors. Here, there is a canonical term for each concept.
(2)A manual thesaurus. Here, human editors have built up sets of synony- mous names for concepts, without designating a canonical term.
(3)An automatically derived thesaurus.
(4)Query reformulations based on query log mining

Query expansion is often effective in increasing recall. However, there is a high cost to manually producing a thesaurus and then updating it for sci- entific and terminological developments within a field.

Muddiest Points For Week 7

(1) how can we compute λ linear interpolation Smoothing?
p ( w | D ) = λ p ( w | M D ) + (1 − λ ) p ( w | M C ).
Can we use cross validation to compute λ?

(2) Dirichlet prior smoothing is a special type of Maximum a posterior estimate using prior knowledge?

Tuesday, February 10, 2015

Reading Notes For Week 5

The standard approach to IR system evaluation revolves around the notion of relevant and nonrelevant documents.

Evaluation of unranked retrieval sets: (1)The two most frequent and basic measures for information retrieval effectiveness are precision and recall.

(2)The measures of precision and recall concentrate the evaluation on the return of true positives, asking what percentage of the relevant documents have been found and how many false positives have also been returned.

(3)The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances.

Evaluation of ranked retrieval results: (1) Precision, recall, and the F measure are set-based measures. They are com- puted using unordered sets of documents.
(2)In a ranked retrieval con- text, appropriate sets of retrieved documents are naturally given by the top k retrieved documents.
(3)An ROC curve plots the true positive rate or sensitivity against the false-positive rate or (1 − specificity).

(1)Sensitivity is just another term for recall. The false-positive rate is given by fp/(fp + tn).
(2)Specificity, given by tn/( f p + tn), was not seen as a very useful notion. Because the set of true negatives is always so large, its value would be al- most 1 for all information needs.

The success of an IR system depends on how good it is at satisfying the needs of these idiosyncratic humans, one informa- tion need at a time.

Marginal relevance is a better measure of utility for the user.

Evaluation at large search engine:non-relevance-based measures
(1) Clickthrough on first result
(2) Studies of user behavior in the lab
(3) A/Btesting

Monday, February 9, 2015

Muddiest Points For Week 6

The assumption of unigram language model is like conditional independence. Can we view this model as a special case of Naive Bayes?

For the estimation of language model, can we see the process of estimation as a machine learning application? In this case, we design a model with its assumptions and the data, which is the text. Then we use the data to train the model and predict on future text. The more data we have, the more predicting accuracy on future data.

Jun Fu

Friday, March 20, 2015

Muddiest Points For Week 9

Thursday, March 12, 2015

Reading Notes For Week 9

Muddiest Points For Week 7

Friday, February 20, 2015

Reading Notes For Week 7

Muddiest Points For Week 7

Tuesday, February 10, 2015

Reading Notes For Week 5

Monday, February 9, 2015

Muddiest Points For Week 6