Jun Fu: April 2015

Friday, April 10, 2015

Reading Notes For Week 13

14.1 Parallel query processing
There are many ways in which parallelism can help a search engine process queries faster. The two most popular approaches are index partitioning and replication

14.1.1 Document Partitioning
In a document-partitioned index, each index server is responsible for a subset of the documents in the collection. Each incoming user query is received by a frontend server, the receptionist, which forwards it to all n index nodes, waits for them to process the query, merges the search results received from the index nodes, and sends the final list of results to the user.

14.1.2 Term Partitioning
Term partitioning addresses the disk seek problem by splitting the collection into sets of terms instead of sets of documents.

Despite its potential performance advantage over the document-partitioned approach, at least for on-disk indices, term partitioning has several shortcomings that make it difficult to use the method in practice:
(1) Scalability
(2) Load Imbalance
(3) Term-at-a-Time

14.1.3 Hybrid Schemes
Consider a distributed index with n nodes, for a collection of size |C|. Document partitioning becomes inefficient if |C|/n is too small and disk seeks dominate the overall query processing cost. Term partitioning, on the other hand, becomes impractical if |C| is too large, as the time required to process a single query is likely to exceed the users’ patience.

As an alternative to the hybrid term/document partitioning, we may also consider a hybrid of document partitioning and replication

14.1.4 Redundancy and Fault Tolerance
When operating a large-scale search engine with thousands of users, reliability and fault toler- ance tend to be of similar importance as response time and result quality.

14.2 MapReduce
MapReduce is a framework developed at Google that is designed for massively parallel compu- tations (thousands of machines) on very large amounts of data (many terabytes), and that can accomplish all of the tasks listed above.

14.2.1 The Basic Framework
MapReduce was inspired by the map and reduce functions found in functional programming languages, such as Lisp.

A MapReduce that counts the number of occurrences of each term in a given corpus of text. The input values processed by the map function are documents or other pieces of text. The input keys are ignored. The outcome of the MapReduce is a sequence of (t,ft) tuples, where t is a term, and ft is the number of times t appeared in the input.

14.2.2 Combiners
In many MapReduce jobs, a single map shard may produce a large number of key/value pairs for the same key.

14.2.3 Secondary Keys
MapReduce supports the concept of secondary keys that can be used to define the order in which values for the same key arrive at the reduce function.

Muddiest Points For Week 12

In text mining, we only have free text data. Those are unstructured data. How do we clean and preprocess these data? In machine learning algorithms, both for supervised learning and unsupervised learning, we have to some features or attributes. How can use use these text as features?

Thursday, April 2, 2015

Reading Note For Week 12

Classification Methods
1. Manual classification
(1) Very accurate when job is done by experts
(2) Consistent when the problem size and team is small
(3) Difficult and expensive to scale
2. Automatic document classification: Hand-coded rule-based systems
3. Supervised learning of a document-label assignment function
Many systems partly rely on machine learning models, such as SVM and KNN

Naive Bayes Assumption: assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).(conditional independence assumption)

Feature Selection
(1) Text collections have a large number of features
(2) May make using a particular classifier feasible
(3) Reduces training time
(4) Can improve generalization (performance)

Evaluating Categorization
(1) Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).
(2) Results can vary based on sampling error due to different training and test sets.
(3) Average results over multiple training and test sets (splits of the overall data) for the best results.

Vector space methods for Text Classification
(1) K Nearest Neighbors
(2) Decision boundaries
(3) Vector space classification using centroids
(4) Decision Trees (briefly)

k Nearest Neighbor Classification
(1) To classify document d into class c:
(2) Define k-neighborhood N as k nearest neighbors of d
(3) Count number of documents i in N that belong to c
(4) Estimate P(c|d) as i/k
(5) Choose as class argmaxc P(c|d) [ = majority class]

Muddiest Points For Week 11

Can we use trigram language models in retrieval model ? In some degree, trigram language model can provide more contextual information and avoid ambiguity for terms with multiple meanings.

Can we explain bayesian interpolation and online bayesian updating once more?