Jun Fu: March 2015

Tuesday, March 24, 2015

Reading Note For Week 11

Content-based recommendation systems: systems that recommend an item to a user based upon a description of the item and a profile of the user’s interests. Content-based recommendation systems analyze item descriptions to identify items that are of particular interest to the user.

Item Representation
(1) Items that can be recommended to the user are often stored in a database table
(2) Unstructured data: free text data
(3) Many domains are best represented by semi-structured data in which there are some attributes with a set of restricted values and some free-text fields.

Two types of information on user profile
(1) A model of the user’s preferences
(2) A history of the user’s interactions with the recommendation system

Creating a model of the user’s preferewnce from the user history is a form of classifica- tion learning. The training data of a classification learner is divided into categories, e.g., the binary categories “items the user likes” and “items the user doesn’t like.”
(1) Decision Trees and Rule Induction
(2) Nearest Neighbor Methods
(3) Relevance Feedback and Rocchio’s Algorithm
(4) Linear Classifiers
(5) Probabilistic Methods and Naïve Bayes

In the modern Web, as the amount of information available causes information over- loading, the demand for personalized approaches for information access increases. Personalized systems address the overload problem by building, managing, and represent- ing information customized for individual users.

Collecting information about users
(1) The information col- lected may be explicitly input by the user or implicitly gathered by a software agent.
(2) Depending on how the information is collected, different data about the users may be extracted.

User Profile Construction
(1) Building Keyword Profiles. Keyword-based profiles are initially created by extracting keywords from Web pages collected from some information source, e.g., the user’s browsing history or book- marks.
(2) Building Semantic Network Profiles: semantic network-based profiles are typically built by collecting explicit positive and/or negative feedback from users.
(3) Building Concept Profiles:

Muddiest Points For Week 10

Can multithreaded crawler avoiding crawling duplicate information? How can they avoid this happening?

Can you explain computation of HITS once more, especially the loop in this process.

Friday, March 20, 2015

Reading Notes For Week 10

User Need:
(1) Informational: want to learn about something
(2) Navigational: want to go to that page
(3) Transactional: want to do something
(4) Gray Areas

Search engine optimization(Spam):
(1) Motives: Commercial, political, religious, lobbies; Promotion funded by advertising budget
(2) Operators: Contractors (Search Engine Optimizers) for lobbies, companies; Webmasters; Hostingservices
(3) Forums

Random searches:
Choose random searches extracted from a local log or build “random searches”
Advantage: Might be a better reflection of the human perception of coverage
Disadvantage: (1)Samples are correlated with source of log
(2)Duplicates
(3)Technical statistical problems (must have non-zero results, ratio average not statistically sound)

Random IP addresses
Advantages：
(1)Clean statistics
(2)Independent of crawling strategies
Disadvantages：
(1)Doesn’t deal with duplication
(2)Many hosts might share one IP, or not accept requests
(3)No guarantee all pages are linked to root page.
(4)Power law for # pages/hosts generates bias towards
sites with few pages.
(5)Potentially influenced by spamming (multiple IP’s for same server to avoid IP block)

Random walks:
View the Web as a directed graph. Build a random walk on this graph
Advantages：
(1)“Statistically clean” method at least in theory!
(2)Could work even for infinite web (assuming
convergence) under certain metrics.
Disadvantages：
(1)List of seeds is a problem.
(2)Practical approximation might not be valid.
(3)Non-uniform distribution 􏰁 Subject to link spamming

Muddiest Points For Week 9

How does scatter/gather work?

Thursday, March 12, 2015

Reading Notes For Week 9

The user interface role:
(1) aid in the searchers’ understanding and expression of their information need
(2) formulate their queries
(3) select among available information sources
(4) understand search results
(5) keep track of the progress of their search

User interaction with search interfaces differs depending on
(1) the type of task
(2) the domain expertise of the information seeker
(3) the amount of time and effort available to invest in the process

User interaction with search interfaces can be divide into two distinct parts
(1) information lookup
(2) exploratory search, including learning and investigating tasks

Classic Model VS. Dynamic Model
Classic notion of the information seeking process:
(1) problem identification
(2) articulation of information need(s)
(3) query formulation
(4) results evaluation

More recent Models emphasize the dynamic nature of the search process
(1) The users learn as they search
(2) Their information needs adjust as they see retrieval results and other document surrogates

The primary methods for searcher to express their information need:
(1) entering words into a search entry form
(2) selecting links from a directory or other information organization
display

The document surrogate refers to the information that summarizes the document.
(1) This information is a key part of the success of the search interface
(2) The design of document surrogates is an active area of research and experimentation
(3) The quality of the surrogate can greatly effect the perceived relevance of the search results listing

Tools to help users reformulate their query
(1) One technique consists of showing terms related to the query or
to the documents retrieved in response to the query
(2) A special case of this is spelling corrections or
suggestions

Organizing Search Results:
(1) Category system: meaningful labels organized in such a way as to reflect the concepts relevant to a domain
(2) Clustering refers to the grouping of items according to some measure of similarity

Visualizing Query Terms
(1) Understanding the role of the query terms within the retrieved docs can help relevance assessment
(2) In the TileBars interface, for instance, documents are shown as horizontal glyphs

Words and Docs Relationships
(1) Numerous works proposed variations on the idea of placing words and docs on a two-dimensional canvas
(2) Another idea is to map docs or words from a very high- dimensional term space down into a 2D plane

Muddiest Points For Week 7

How do we use Interpolation to describe the relationship between precision and recall using graph?

Can we talk about relevance feedback in language model?