Jun Fu: January 2015

Thursday, January 29, 2015

Reading Notes For Week 4

Chapter 1.3 and 1.4

Posting list intersection is a crucial operation in processing boolean queries.
Standard postings list intersection operations remain necessary when both terms of a query are very common.

If the lengths of the postings lists are x and y, the intersection takes O(x + y) operations. Time complexity is also a big issue in this step.

Query optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system.

A proximity operator is a way of specifying that two terms in a query must occur close to each other in a document, where closeness may be measured by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph.

There is a tradeoff between using AND operator and OR operator. AND operators tends to produce high precision but low recall searches, while using OR operators gives low precision but high recall searches, and it is difficult or impossible to find a satisfactory middle ground.

chapter 6

Boolean queries: A document either matches or does not match a query.
It is essential for a search engine to rank-order the documents matching a query.

Metadata would include fields, such as the date of creation and the format of the document, as well the author and possibly the title of the documentZones are similar to fields, except the contents of a zone can be arbitrary
free text.

The representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval (IR) operations including scoring documents on a query, document classification, and document clustering.

Vector Space Ranking
(1)Represent the query as a weighted tf-idf vector
(2)Represent each document as a weighted tf-idf vector
(3)Compute the cosine similarity score for the query vector and each document vector
(4)Rank documents with respect to the query by score
(5)Return the top K (e.g., K = 10) to the user

Muddiest Points For Week 4

How do compute the weight in a query language? For example in Queries from TREC Topics, #wsum(1 4.0 #UW3(identity theft) 4.0 identity 4.0 theft 2.0 safe 2.0 information 2.0 Internet). Is the weight about the title, descriptive and its content?

Can we use B+ Tree instead of ordinary B-Tree in access method?

Friday, January 23, 2015

Reading Note For Week 3

Many design decisions in information retrieval are based on the characteristics of hardware.

Access to data in memory is much faster than access to data on disk.

Compression makes SPIMI even more efficient.(1) Compression of terms (2) Compression of postings

Distributed indexing (1) Maintain a master machine directing the indexing job – considered “safe”. (2) Break up indexing into sets of (parallel) tasks. (3) Master machine assigns each task to an idle
machine from a pool.

Map-Reduce: In general, MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairs.

compression: Keep more stuff in memory (increases speed). Increase data transfer from disk to memory.

Compression in inverted indexes: (1) First, we will consider space for dictionary. Make it small enough to keep in main memory. (2) Then the postings. Reduce disk space needed, decrease time to read from disk. Large search engines keep a significant part of postings in memory.

Lossless compression: All information is preserved. Lossy compression: Discard some information.

Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination.

Zipf’s law: The ith most frequent term has frequency proportional to 1/i .

Muddiest Points For Week 2

Stemming never lowers recall (coverage). At this point, what is really recall mean?

How incorrect sense assignments would hurt recall?

Friday, January 9, 2015

Reading Note For Week 2

The tokens and normalized token can be seen as loosely equivalent words.

The first step of precessing is to convert this byte sequence, the digital documents in a file or on a web server, into a linear sequence of documents. And we need to determine the correct encoding of that document, such as UTF-8 and ASCII, also the document format.

Then, we need to determine the document unit for indexing. For very long document, the issue of indexing granularity comes out. This issue is a tradeoff.

Given a character sequence, we need to chop it into pieces, this is called tokens.

We may drop common terms, the stop words, such as a and an. But the meaning of phrase may change if drop common terms.

Token normalization is the process of canonicalizing tokens so that matches normalization occur despite superficial differences in the character sequences of the to equivalence kens. The most standard way to normalize is to create equivalence classes.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

For phrase queries, one approach to handling phrases is to consider every pair of consecutive terms in a document as a phrase. However, the standard solution is positional index. The strategies of biword indexes and positional indexes can be fruitfully combined.

There are two search data structure for dictionaries, respectively hashing and search trees.
For hashing, there is no easy way to find minor variants of a query term, because these could be hashed to very different integers. The search trees overcome many of these issues. However, for binary tree, it should be balanced. This costs a lot if the search tree changes frequently, for example deletion. There comes out the B-tree, a balanced tree.

A query such as mon* is known as a trailing wildcard query, because the * symbol occurs only once, at the end of the search string.

Final technique for tolerant retrieval has to do with phonetic correction: misspellings that arise because the user types a query that sounds like the tar- get term. The main idea here is to generate, for each term, a “phonetic hash” so that similar-sounding terms hash to the same value.

Muddiest Points For Week 1

I really did not understand the information retrieval cycle, especially the feedback at each step. I want to know the reason why predict, nominate and choose divide this cycle into three parts.

Is information retrieval an application of natural language processing? If so, what's the connections between them.

Thursday, January 8, 2015

Reading Note For Week 1

Information retrieval (IR) is concerned with representing, searching, and manipulating large collections of electronic text and other human-language data.

The rapid development of World Wide Web leads to widely use of IR. The original IR originated from the development of libraries. And now it is largely used in Web, for example standard search engines, like Google and Bing.

In terms of research, IR can be studied from two distinct points of view: a computed-centered one and a human-centered one.

Indexes are at the core of every modern information retrieval system. The key goal of IR system is to retrieve information that is useful or relevant to the user.

There are two principal aspects to measuring IR system performance: efficiency and effectiveness. Efficiency can be evaluated by time (e.g. seconds per query) and space (e.g. bytes by documents). Effectiveness is more difficult to measure than efficiency. It depends entirely on human judgment. There comes the notion of relevance. A document is considered relevant to a given query if its contents (completely or partially) satisfy the information need represented by the query.

The user information need may be shifted. For example, once the user checks the top-ranked document and learns its relevant contents, he or she will need some novel information when clicking the second top-ranked document.

The ranking results maybe different in same query. The query history of users and the location may have an influence on the results. Also the ranking performance can be improved the hyperlinks and user clicks.

Security and privacy are the two practical issues on the web.

The IR from a cognitive view is about finding out about (FOA). The FOA process of browsing readers can be imagined to involve three phrases: (1) asking a question; (2) constructing an answer; (3) assessing the answer.

A paradoxical feature of the FOA problems is that if users knew their question, precisely, they might not even need the search engine we designing- forming a clearly posed question is often the hardest part of answering it. Assessing the answer is “closing of the loop” between asker and answer, whereby the user (asker) provides an assessment of just how relevant they find the answer provided.