Jun Fu: 2015

Friday, April 10, 2015

Reading Notes For Week 13

14.1 Parallel query processing
There are many ways in which parallelism can help a search engine process queries faster. The two most popular approaches are index partitioning and replication

14.1.1 Document Partitioning
In a document-partitioned index, each index server is responsible for a subset of the documents in the collection. Each incoming user query is received by a frontend server, the receptionist, which forwards it to all n index nodes, waits for them to process the query, merges the search results received from the index nodes, and sends the final list of results to the user.

14.1.2 Term Partitioning
Term partitioning addresses the disk seek problem by splitting the collection into sets of terms instead of sets of documents.

Despite its potential performance advantage over the document-partitioned approach, at least for on-disk indices, term partitioning has several shortcomings that make it difficult to use the method in practice:
(1) Scalability
(2) Load Imbalance
(3) Term-at-a-Time

14.1.3 Hybrid Schemes
Consider a distributed index with n nodes, for a collection of size |C|. Document partitioning becomes inefficient if |C|/n is too small and disk seeks dominate the overall query processing cost. Term partitioning, on the other hand, becomes impractical if |C| is too large, as the time required to process a single query is likely to exceed the users’ patience.

As an alternative to the hybrid term/document partitioning, we may also consider a hybrid of document partitioning and replication

14.1.4 Redundancy and Fault Tolerance
When operating a large-scale search engine with thousands of users, reliability and fault toler- ance tend to be of similar importance as response time and result quality.

14.2 MapReduce
MapReduce is a framework developed at Google that is designed for massively parallel compu- tations (thousands of machines) on very large amounts of data (many terabytes), and that can accomplish all of the tasks listed above.

14.2.1 The Basic Framework
MapReduce was inspired by the map and reduce functions found in functional programming languages, such as Lisp.

A MapReduce that counts the number of occurrences of each term in a given corpus of text. The input values processed by the map function are documents or other pieces of text. The input keys are ignored. The outcome of the MapReduce is a sequence of (t,ft) tuples, where t is a term, and ft is the number of times t appeared in the input.

14.2.2 Combiners
In many MapReduce jobs, a single map shard may produce a large number of key/value pairs for the same key.

14.2.3 Secondary Keys
MapReduce supports the concept of secondary keys that can be used to define the order in which values for the same key arrive at the reduce function.

Muddiest Points For Week 12

In text mining, we only have free text data. Those are unstructured data. How do we clean and preprocess these data? In machine learning algorithms, both for supervised learning and unsupervised learning, we have to some features or attributes. How can use use these text as features?

Thursday, April 2, 2015

Reading Note For Week 12

Classification Methods
1. Manual classification
(1) Very accurate when job is done by experts
(2) Consistent when the problem size and team is small
(3) Difficult and expensive to scale
2. Automatic document classification: Hand-coded rule-based systems
3. Supervised learning of a document-label assignment function
Many systems partly rely on machine learning models, such as SVM and KNN

Naive Bayes Assumption: assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).(conditional independence assumption)

Feature Selection
(1) Text collections have a large number of features
(2) May make using a particular classifier feasible
(3) Reduces training time
(4) Can improve generalization (performance)

Evaluating Categorization
(1) Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).
(2) Results can vary based on sampling error due to different training and test sets.
(3) Average results over multiple training and test sets (splits of the overall data) for the best results.

Vector space methods for Text Classification
(1) K Nearest Neighbors
(2) Decision boundaries
(3) Vector space classification using centroids
(4) Decision Trees (briefly)

k Nearest Neighbor Classification
(1) To classify document d into class c:
(2) Define k-neighborhood N as k nearest neighbors of d
(3) Count number of documents i in N that belong to c
(4) Estimate P(c|d) as i/k
(5) Choose as class argmaxc P(c|d) [ = majority class]

Muddiest Points For Week 11

Can we use trigram language models in retrieval model ? In some degree, trigram language model can provide more contextual information and avoid ambiguity for terms with multiple meanings.

Can we explain bayesian interpolation and online bayesian updating once more?

Tuesday, March 24, 2015

Reading Note For Week 11

Content-based recommendation systems: systems that recommend an item to a user based upon a description of the item and a profile of the user’s interests. Content-based recommendation systems analyze item descriptions to identify items that are of particular interest to the user.

Item Representation
(1) Items that can be recommended to the user are often stored in a database table
(2) Unstructured data: free text data
(3) Many domains are best represented by semi-structured data in which there are some attributes with a set of restricted values and some free-text fields.

Two types of information on user profile
(1) A model of the user’s preferences
(2) A history of the user’s interactions with the recommendation system

Creating a model of the user’s preferewnce from the user history is a form of classifica- tion learning. The training data of a classification learner is divided into categories, e.g., the binary categories “items the user likes” and “items the user doesn’t like.”
(1) Decision Trees and Rule Induction
(2) Nearest Neighbor Methods
(3) Relevance Feedback and Rocchio’s Algorithm
(4) Linear Classifiers
(5) Probabilistic Methods and Naïve Bayes

In the modern Web, as the amount of information available causes information over- loading, the demand for personalized approaches for information access increases. Personalized systems address the overload problem by building, managing, and represent- ing information customized for individual users.

Collecting information about users
(1) The information col- lected may be explicitly input by the user or implicitly gathered by a software agent.
(2) Depending on how the information is collected, different data about the users may be extracted.

User Profile Construction
(1) Building Keyword Profiles. Keyword-based profiles are initially created by extracting keywords from Web pages collected from some information source, e.g., the user’s browsing history or book- marks.
(2) Building Semantic Network Profiles: semantic network-based profiles are typically built by collecting explicit positive and/or negative feedback from users.
(3) Building Concept Profiles:

Muddiest Points For Week 10

Can multithreaded crawler avoiding crawling duplicate information? How can they avoid this happening?

Can you explain computation of HITS once more, especially the loop in this process.

Friday, March 20, 2015

Reading Notes For Week 10

User Need:
(1) Informational: want to learn about something
(2) Navigational: want to go to that page
(3) Transactional: want to do something
(4) Gray Areas

Search engine optimization(Spam):
(1) Motives: Commercial, political, religious, lobbies; Promotion funded by advertising budget
(2) Operators: Contractors (Search Engine Optimizers) for lobbies, companies; Webmasters; Hostingservices
(3) Forums

Random searches:
Choose random searches extracted from a local log or build “random searches”
Advantage: Might be a better reflection of the human perception of coverage
Disadvantage: (1)Samples are correlated with source of log
(2)Duplicates
(3)Technical statistical problems (must have non-zero results, ratio average not statistically sound)

Random IP addresses
Advantages：
(1)Clean statistics
(2)Independent of crawling strategies
Disadvantages：
(1)Doesn’t deal with duplication
(2)Many hosts might share one IP, or not accept requests
(3)No guarantee all pages are linked to root page.
(4)Power law for # pages/hosts generates bias towards
sites with few pages.
(5)Potentially influenced by spamming (multiple IP’s for same server to avoid IP block)

Random walks:
View the Web as a directed graph. Build a random walk on this graph
Advantages：
(1)“Statistically clean” method at least in theory!
(2)Could work even for infinite web (assuming
convergence) under certain metrics.
Disadvantages：
(1)List of seeds is a problem.
(2)Practical approximation might not be valid.
(3)Non-uniform distribution 􏰁 Subject to link spamming

Muddiest Points For Week 9

How does scatter/gather work?

Thursday, March 12, 2015

Reading Notes For Week 9

The user interface role:
(1) aid in the searchers’ understanding and expression of their information need
(2) formulate their queries
(3) select among available information sources
(4) understand search results
(5) keep track of the progress of their search

User interaction with search interfaces differs depending on
(1) the type of task
(2) the domain expertise of the information seeker
(3) the amount of time and effort available to invest in the process

User interaction with search interfaces can be divide into two distinct parts
(1) information lookup
(2) exploratory search, including learning and investigating tasks

Classic Model VS. Dynamic Model
Classic notion of the information seeking process:
(1) problem identification
(2) articulation of information need(s)
(3) query formulation
(4) results evaluation

More recent Models emphasize the dynamic nature of the search process
(1) The users learn as they search
(2) Their information needs adjust as they see retrieval results and other document surrogates

The primary methods for searcher to express their information need:
(1) entering words into a search entry form
(2) selecting links from a directory or other information organization
display

The document surrogate refers to the information that summarizes the document.
(1) This information is a key part of the success of the search interface
(2) The design of document surrogates is an active area of research and experimentation
(3) The quality of the surrogate can greatly effect the perceived relevance of the search results listing

Tools to help users reformulate their query
(1) One technique consists of showing terms related to the query or
to the documents retrieved in response to the query
(2) A special case of this is spelling corrections or
suggestions

Organizing Search Results:
(1) Category system: meaningful labels organized in such a way as to reflect the concepts relevant to a domain
(2) Clustering refers to the grouping of items according to some measure of similarity

Visualizing Query Terms
(1) Understanding the role of the query terms within the retrieved docs can help relevance assessment
(2) In the TileBars interface, for instance, documents are shown as horizontal glyphs

Words and Docs Relationships
(1) Numerous works proposed variations on the idea of placing words and docs on a two-dimensional canvas
(2) Another idea is to map docs or words from a very high- dimensional term space down into a 2D plane

Muddiest Points For Week 7

How do we use Interpolation to describe the relationship between precision and recall using graph?

Can we talk about relevance feedback in language model?

Friday, February 20, 2015

Reading Notes For Week 7

The idea of relevance feedback (RF) is to involve the user in the IR process so as to improve the final result set.

The Rocchio algorithm is the classic algorithm for implementing relevance feedback. It mod- els a way of incorporating relevance feedback information into the vector space model. However, we don’t know the truly relevant docs

The success of RF depends on certain assumptions. First, the user has to have sufficient knowledge to be able to make an initial query that is at least some- where close to the documents they desire. Second, the RF approach requires relevant documents to be similar to each other.

Pseudo relevance feedback (blind relevance feedback): It automates the manual part of RF, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that the top k ranked documents are rel- evant, and finally to do RF as before under this assumption.

In query expansion, on the other hand, users give additional input on query words or phrases, possibly suggesting additional query terms.

Methods for building a thesaurus for query expansion
(1)Use of a controlled vocabulary that is maintained by human editors. Here, there is a canonical term for each concept.
(2)A manual thesaurus. Here, human editors have built up sets of synony- mous names for concepts, without designating a canonical term.
(3)An automatically derived thesaurus.
(4)Query reformulations based on query log mining

Query expansion is often effective in increasing recall. However, there is a high cost to manually producing a thesaurus and then updating it for sci- entific and terminological developments within a field.

Muddiest Points For Week 7

(1) how can we compute λ linear interpolation Smoothing?
p ( w | D ) = λ p ( w | M D ) + (1 − λ ) p ( w | M C ).
Can we use cross validation to compute λ?

(2) Dirichlet prior smoothing is a special type of Maximum a posterior estimate using prior knowledge?

Tuesday, February 10, 2015

Reading Notes For Week 5

The standard approach to IR system evaluation revolves around the notion of relevant and nonrelevant documents.

Evaluation of unranked retrieval sets: (1)The two most frequent and basic measures for information retrieval effectiveness are precision and recall.

(2)The measures of precision and recall concentrate the evaluation on the return of true positives, asking what percentage of the relevant documents have been found and how many false positives have also been returned.

(3)The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances.

Evaluation of ranked retrieval results: (1) Precision, recall, and the F measure are set-based measures. They are com- puted using unordered sets of documents.
(2)In a ranked retrieval con- text, appropriate sets of retrieved documents are naturally given by the top k retrieved documents.
(3)An ROC curve plots the true positive rate or sensitivity against the false-positive rate or (1 − specificity).

(1)Sensitivity is just another term for recall. The false-positive rate is given by fp/(fp + tn).
(2)Specificity, given by tn/( f p + tn), was not seen as a very useful notion. Because the set of true negatives is always so large, its value would be al- most 1 for all information needs.

The success of an IR system depends on how good it is at satisfying the needs of these idiosyncratic humans, one informa- tion need at a time.

Marginal relevance is a better measure of utility for the user.

Evaluation at large search engine:non-relevance-based measures
(1) Clickthrough on first result
(2) Studies of user behavior in the lab
(3) A/Btesting

Monday, February 9, 2015

Muddiest Points For Week 6

The assumption of unigram language model is like conditional independence. Can we view this model as a special case of Naive Bayes?

For the estimation of language model, can we see the process of estimation as a machine learning application? In this case, we design a model with its assumptions and the data, which is the text. Then we use the data to train the model and predict on future text. The more data we have, the more predicting accuracy on future data.

Friday, February 6, 2015

Reading Notes For Week 5

Probability ranking principle: Using a probabilistic model, the obvious order in which to present doc- uments to the user is to rank documents by their estimated probability of relevance with respect to the information need: P(R = 1|d, q).

Bayes optimal decision rule: If a set of retrieval results is to be returned, rather than an ordering, the Bayes optimal decision rule, the decision that minimizes the risk of loss, is to simply return documents that are more likely relevant than nonrelevant:
d is relevant iff P(R = 1|d, q) > P(R = 0|d, q).

In machine learning and probability, the bayes decision rule can reduce the error for the whole class of response variable. In this case, R = 1 & R = 0.

Binary Independence Model: Documents and queries are both represented as binary term incidence vectors. A document d is represented by the vector x⃗ = (x1, ..., xM) where xt = 1 if term t is present in document d and xt = 0 if t is not present in d.

Naive Bayes assumption is very important in modeling process. Conditional independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word. In such a assumption, the computation is simple and intuitive and we do not need to compute the conditional probability and joint probability.

MLE makes the observed data maximally likely. MAP uses the prior knowledge about the distribution. We choose the most likely point value for probabilities based on the prior and the ob- served evidence.

Bayesian networks: a form of probabilistic graph- ical model.

Generative model: A traditional generative generative model of a language, of the kind familiar from formal language model theory, can be used either to recognize or to generate strings.

Language model: a function that puts a probability measure over strings drawn from some vocabulary.

Query likelihood model: rank documents by model P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query.

A translation model lets you generate query words not in a document by translation to alternate terms with similar meaning. This also provides a basis for performing cross-language IR.

Muddiest Points For Week 5

proximity operators. Is this about abbreviation and other form to represent a whole word. Like #od2(information retrieval)

the difference between document frequency and collection frequency:
The document frequency is the number of documents that obtain a certain term in just a collection.
The collection frequency is about the number of occurrences of t in the collection. In this case, it means several collections or just one collection
Are these correct?

Thursday, January 29, 2015

Reading Notes For Week 4

Chapter 1.3 and 1.4

Posting list intersection is a crucial operation in processing boolean queries.
Standard postings list intersection operations remain necessary when both terms of a query are very common.

If the lengths of the postings lists are x and y, the intersection takes O(x + y) operations. Time complexity is also a big issue in this step.

Query optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system.

A proximity operator is a way of specifying that two terms in a query must occur close to each other in a document, where closeness may be measured by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.

There is a tradeoff between using AND operator and OR operator. AND operators tends to produce high precision but low recall searches, while using OR operators gives low precision but high recall searches, and it is difficult or impossible to find a satisfactory middle ground.

chapter 6

Boolean queries: A document either matches or does not match a query.
It is essential for a search engine to rank-order the documents matching a query.

Metadata would include fields, such as the date of creation and the format of the document, as well the author and possibly the title of the documentZones are similar to fields, except the contents of a zone can be arbitrary
free text.

The representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval (IR) operations including scoring documents on a query, document classification, and document clustering.

Vector Space Ranking
(1)Represent the query as a weighted tf-idf vector
(2)Represent each document as a weighted tf-idf vector
(3)Compute the cosine similarity score for the query vector and each document vector
(4)Rank documents with respect to the query by score
(5)Return the top K (e.g., K = 10) to the user

Muddiest Points For Week 4

How do compute the weight in a query language? For example in Queries from TREC Topics, #wsum(1 4.0 #UW3(identity theft) 4.0 identity 4.0 theft 2.0 safe 2.0 information 2.0 Internet). Is the weight about the title, descriptive and its content?

Can we use B+ Tree instead of ordinary B-Tree in access method?

Friday, January 23, 2015

Reading Note For Week 3

Many design decisions in information retrieval are based on the characteristics of hardware.

Access to data in memory is much faster than access to data on disk.

Compression makes SPIMI even more efficient.(1) Compression of terms (2) Compression of postings

Distributed indexing (1) Maintain a master machine directing the indexing job – considered “safe”. (2) Break up indexing into sets of (parallel) tasks. (3) Master machine assigns each task to an idle
machine from a pool.

Map-Reduce: In general, MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairs.

compression: Keep more stuff in memory (increases speed). Increase data transfer from disk to memory.

Compression in inverted indexes: (1) First, we will consider space for dictionary. Make it small enough to keep in main memory. (2) Then the postings. Reduce disk space needed, decrease time to read from disk. Large search engines keep a significant part of postings in memory.

Lossless compression: All information is preserved. Lossy compression: Discard some information.

Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination.

Zipf’s law: The ith most frequent term has frequency proportional to 1/i .

Muddiest Points For Week 2

Stemming never lowers recall (coverage). At this point, what is really recall mean?

How incorrect sense assignments would hurt recall?

Friday, January 9, 2015

Reading Note For Week 2

The tokens and normalized token can be seen as loosely equivalent words.

The first step of precessing is to convert this byte sequence, the digital documents in a file or on a web server, into a linear sequence of documents. And we need to determine the correct encoding of that document, such as UTF-8 and ASCII, also the document format.

Then, we need to determine the document unit for indexing. For very long document, the issue of indexing granularity comes out. This issue is a tradeoff.

Given a character sequence, we need to chop it into pieces, this is called tokens.

We may drop common terms, the stop words, such as a and an. But the meaning of phrase may change if drop common terms.

Token normalization is the process of canonicalizing tokens so that matches normalization occur despite superficial differences in the character sequences of the to equivalence kens. The most standard way to normalize is to create equivalence classes.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

For phrase queries, one approach to handling phrases is to consider every pair of consecutive terms in a document as a phrase. However, the standard solution is positional index. The strategies of biword indexes and positional indexes can be fruitfully combined.

There are two search data structure for dictionaries, respectively hashing and search trees.
For hashing, there is no easy way to find minor variants of a query term, because these could be hashed to very different integers. The search trees overcome many of these issues. However, for binary tree, it should be balanced. This costs a lot if the search tree changes frequently, for example deletion. There comes out the B-tree, a balanced tree.

A query such as mon* is known as a trailing wildcard query, because the * symbol occurs only once, at the end of the search string.

Final technique for tolerant retrieval has to do with phonetic correction: misspellings that arise because the user types a query that sounds like the tar- get term. The main idea here is to generate, for each term, a “phonetic hash” so that similar-sounding terms hash to the same value.

Muddiest Points For Week 1

I really did not understand the information retrieval cycle, especially the feedback at each step. I want to know the reason why predict, nominate and choose divide this cycle into three parts.

Is information retrieval an application of natural language processing? If so, what's the connections between them.

Thursday, January 8, 2015

Reading Note For Week 1

Information retrieval (IR) is concerned with representing, searching, and manipulating large collections of electronic text and other human-language data.

The rapid development of World Wide Web leads to widely use of IR. The original IR originated from the development of libraries. And now it is largely used in Web, for example standard search engines, like Google and Bing.

In terms of research, IR can be studied from two distinct points of view: a computed-centered one and a human-centered one.

Indexes are at the core of every modern information retrieval system. The key goal of IR system is to retrieve information that is useful or relevant to the user.

There are two principal aspects to measuring IR system performance: efficiency and effectiveness. Efficiency can be evaluated by time (e.g. seconds per query) and space (e.g. bytes by documents). Effectiveness is more difficult to measure than efficiency. It depends entirely on human judgment. There comes the notion of relevance. A document is considered relevant to a given query if its contents (completely or partially) satisfy the information need represented by the query.

The user information need may be shifted. For example, once the user checks the top-ranked document and learns its relevant contents, he or she will need some novel information when clicking the second top-ranked document.

The ranking results maybe different in same query. The query history of users and the location may have an influence on the results. Also the ranking performance can be improved the hyperlinks and user clicks.

Security and privacy are the two practical issues on the web.

The IR from a cognitive view is about finding out about (FOA). The FOA process of browsing readers can be imagined to involve three phrases: (1) asking a question; (2) constructing an answer; (3) assessing the answer.

A paradoxical feature of the FOA problems is that if users knew their question, precisely, they might not even need the search engine we designing- forming a clearly posed question is often the hardest part of answering it. Assessing the answer is “closing of the loop” between asker and answer, whereby the user (asker) provides an assessment of just how relevant they find the answer provided.