Jun Fu: Reading Notes For Week 13

14.1 Parallel query processing
There are many ways in which parallelism can help a search engine process queries faster. The two most popular approaches are index partitioning and replication

14.1.1 Document Partitioning
In a document-partitioned index, each index server is responsible for a subset of the documents in the collection. Each incoming user query is received by a frontend server, the receptionist, which forwards it to all n index nodes, waits for them to process the query, merges the search results received from the index nodes, and sends the final list of results to the user.

14.1.2 Term Partitioning
Term partitioning addresses the disk seek problem by splitting the collection into sets of terms instead of sets of documents.

Despite its potential performance advantage over the document-partitioned approach, at least for on-disk indices, term partitioning has several shortcomings that make it difficult to use the method in practice:
(1) Scalability
(2) Load Imbalance
(3) Term-at-a-Time

14.1.3 Hybrid Schemes
Consider a distributed index with n nodes, for a collection of size |C|. Document partitioning becomes inefficient if |C|/n is too small and disk seeks dominate the overall query processing cost. Term partitioning, on the other hand, becomes impractical if |C| is too large, as the time required to process a single query is likely to exceed the users’ patience.

As an alternative to the hybrid term/document partitioning, we may also consider a hybrid of document partitioning and replication

14.1.4 Redundancy and Fault Tolerance
When operating a large-scale search engine with thousands of users, reliability and fault toler- ance tend to be of similar importance as response time and result quality.

14.2 MapReduce
MapReduce is a framework developed at Google that is designed for massively parallel compu- tations (thousands of machines) on very large amounts of data (many terabytes), and that can accomplish all of the tasks listed above.

14.2.1 The Basic Framework
MapReduce was inspired by the map and reduce functions found in functional programming languages, such as Lisp.

A MapReduce that counts the number of occurrences of each term in a given corpus of text. The input values processed by the map function are documents or other pieces of text. The input keys are ignored. The outcome of the MapReduce is a sequence of (t,ft) tuples, where t is a term, and ft is the number of times t appeared in the input.

14.2.2 Combiners
In many MapReduce jobs, a single map shard may produce a large number of key/value pairs for the same key.

14.2.3 Secondary Keys
MapReduce supports the concept of secondary keys that can be used to define the order in which values for the same key arrive at the reduce function.

Jun Fu

Friday, April 10, 2015

Reading Notes For Week 13

No comments:

Post a Comment