Jun Fu

Friday, April 10, 2015

Reading Notes For Week 13

14.1 Parallel query processing
There are many ways in which parallelism can help a search engine process queries faster. The two most popular approaches are index partitioning and replication

14.1.1 Document Partitioning
In a document-partitioned index, each index server is responsible for a subset of the documents in the collection. Each incoming user query is received by a frontend server, the receptionist, which forwards it to all n index nodes, waits for them to process the query, merges the search results received from the index nodes, and sends the final list of results to the user.

14.1.2 Term Partitioning
Term partitioning addresses the disk seek problem by splitting the collection into sets of terms instead of sets of documents.

Despite its potential performance advantage over the document-partitioned approach, at least for on-disk indices, term partitioning has several shortcomings that make it difficult to use the method in practice:
(1) Scalability
(2) Load Imbalance
(3) Term-at-a-Time

14.1.3 Hybrid Schemes
Consider a distributed index with n nodes, for a collection of size |C|. Document partitioning becomes inefficient if |C|/n is too small and disk seeks dominate the overall query processing cost. Term partitioning, on the other hand, becomes impractical if |C| is too large, as the time required to process a single query is likely to exceed the users’ patience.

As an alternative to the hybrid term/document partitioning, we may also consider a hybrid of document partitioning and replication

14.1.4 Redundancy and Fault Tolerance
When operating a large-scale search engine with thousands of users, reliability and fault toler- ance tend to be of similar importance as response time and result quality.

14.2 MapReduce
MapReduce is a framework developed at Google that is designed for massively parallel compu- tations (thousands of machines) on very large amounts of data (many terabytes), and that can accomplish all of the tasks listed above.

14.2.1 The Basic Framework
MapReduce was inspired by the map and reduce functions found in functional programming languages, such as Lisp.

A MapReduce that counts the number of occurrences of each term in a given corpus of text. The input values processed by the map function are documents or other pieces of text. The input keys are ignored. The outcome of the MapReduce is a sequence of (t,ft) tuples, where t is a term, and ft is the number of times t appeared in the input.

14.2.2 Combiners
In many MapReduce jobs, a single map shard may produce a large number of key/value pairs for the same key.

14.2.3 Secondary Keys
MapReduce supports the concept of secondary keys that can be used to define the order in which values for the same key arrive at the reduce function.

Muddiest Points For Week 12

In text mining, we only have free text data. Those are unstructured data. How do we clean and preprocess these data? In machine learning algorithms, both for supervised learning and unsupervised learning, we have to some features or attributes. How can use use these text as features?

Thursday, April 2, 2015

Reading Note For Week 12

Classification Methods
1. Manual classification
(1) Very accurate when job is done by experts
(2) Consistent when the problem size and team is small
(3) Difficult and expensive to scale
2. Automatic document classification: Hand-coded rule-based systems
3. Supervised learning of a document-label assignment function
Many systems partly rely on machine learning models, such as SVM and KNN

Naive Bayes Assumption: assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).(conditional independence assumption)

Feature Selection
(1) Text collections have a large number of features
(2) May make using a particular classifier feasible
(3) Reduces training time
(4) Can improve generalization (performance)

Evaluating Categorization
(1) Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).
(2) Results can vary based on sampling error due to different training and test sets.
(3) Average results over multiple training and test sets (splits of the overall data) for the best results.

Vector space methods for Text Classification
(1) K Nearest Neighbors
(2) Decision boundaries
(3) Vector space classification using centroids
(4) Decision Trees (briefly)

k Nearest Neighbor Classification
(1) To classify document d into class c:
(2) Define k-neighborhood N as k nearest neighbors of d
(3) Count number of documents i in N that belong to c
(4) Estimate P(c|d) as i/k
(5) Choose as class argmaxc P(c|d) [ = majority class]

Muddiest Points For Week 11

Can we use trigram language models in retrieval model ? In some degree, trigram language model can provide more contextual information and avoid ambiguity for terms with multiple meanings.

Can we explain bayesian interpolation and online bayesian updating once more?

Tuesday, March 24, 2015

Reading Note For Week 11

Content-based recommendation systems: systems that recommend an item to a user based upon a description of the item and a profile of the user’s interests. Content-based recommendation systems analyze item descriptions to identify items that are of particular interest to the user.

Item Representation
(1) Items that can be recommended to the user are often stored in a database table
(2) Unstructured data: free text data
(3) Many domains are best represented by semi-structured data in which there are some attributes with a set of restricted values and some free-text fields.

Two types of information on user profile
(1) A model of the user’s preferences
(2) A history of the user’s interactions with the recommendation system

Creating a model of the user’s preferewnce from the user history is a form of classifica- tion learning. The training data of a classification learner is divided into categories, e.g., the binary categories “items the user likes” and “items the user doesn’t like.”
(1) Decision Trees and Rule Induction
(2) Nearest Neighbor Methods
(3) Relevance Feedback and Rocchio’s Algorithm
(4) Linear Classifiers
(5) Probabilistic Methods and Naïve Bayes

In the modern Web, as the amount of information available causes information over- loading, the demand for personalized approaches for information access increases. Personalized systems address the overload problem by building, managing, and represent- ing information customized for individual users.

Collecting information about users
(1) The information col- lected may be explicitly input by the user or implicitly gathered by a software agent.
(2) Depending on how the information is collected, different data about the users may be extracted.

User Profile Construction
(1) Building Keyword Profiles. Keyword-based profiles are initially created by extracting keywords from Web pages collected from some information source, e.g., the user’s browsing history or book- marks.
(2) Building Semantic Network Profiles: semantic network-based profiles are typically built by collecting explicit positive and/or negative feedback from users.
(3) Building Concept Profiles:

Muddiest Points For Week 10

Can multithreaded crawler avoiding crawling duplicate information? How can they avoid this happening?

Can you explain computation of HITS once more, especially the loop in this process.

Friday, March 20, 2015

Reading Notes For Week 10

User Need:
(1) Informational: want to learn about something
(2) Navigational: want to go to that page
(3) Transactional: want to do something
(4) Gray Areas

Search engine optimization(Spam):
(1) Motives: Commercial, political, religious, lobbies; Promotion funded by advertising budget
(2) Operators: Contractors (Search Engine Optimizers) for lobbies, companies; Webmasters; Hostingservices
(3) Forums

Random searches:
Choose random searches extracted from a local log or build “random searches”
Advantage: Might be a better reflection of the human perception of coverage
Disadvantage: (1)Samples are correlated with source of log
(2)Duplicates
(3)Technical statistical problems (must have non-zero results, ratio average not statistically sound)

Random IP addresses
Advantages：
(1)Clean statistics
(2)Independent of crawling strategies
Disadvantages：
(1)Doesn’t deal with duplication
(2)Many hosts might share one IP, or not accept requests
(3)No guarantee all pages are linked to root page.
(4)Power law for # pages/hosts generates bias towards
sites with few pages.
(5)Potentially influenced by spamming (multiple IP’s for same server to avoid IP block)

Random walks:
View the Web as a directed graph. Build a random walk on this graph
Advantages：
(1)“Statistically clean” method at least in theory!
(2)Could work even for infinite web (assuming
convergence) under certain metrics.
Disadvantages：
(1)List of seeds is a problem.
(2)Practical approximation might not be valid.
(3)Non-uniform distribution 􏰁 Subject to link spamming