Jun Fu: Reading Note For Week 3

Many design decisions in information retrieval are based on the characteristics of hardware.

Access to data in memory is much faster than access to data on disk.

Compression makes SPIMI even more efficient.(1) Compression of terms (2) Compression of postings

Distributed indexing (1) Maintain a master machine directing the indexing job – considered “safe”. (2) Break up indexing into sets of (parallel) tasks. (3) Master machine assigns each task to an idle
machine from a pool.

Map-Reduce: In general, MapReduce breaks a large computing problem into smaller parts by recasting it in terms of manipulation of key-value pairs.

compression: Keep more stuff in memory (increases speed). Increase data transfer from disk to memory.

Compression in inverted indexes: (1) First, we will consider space for dictionary. Make it small enough to keep in main memory. (2) Then the postings. Reduce disk space needed, decrease time to read from disk. Large search engines keep a significant part of postings in memory.

Lossless compression: All information is preserved. Lossy compression: Discard some information.

Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination.

Zipf’s law: The ith most frequent term has frequency proportional to 1/i .

Jun Fu

Friday, January 23, 2015

Reading Note For Week 3

No comments:

Post a Comment