Friday, March 20, 2015

Reading Notes For Week 10

User Need:
(1) Informational: want to learn about something
(2) Navigational: want to go to that page
(3) Transactional: want to do something
(4) Gray Areas

Search engine optimization(Spam):
(1) Motives: Commercial, political, religious, lobbies; Promotion funded by advertising budget
(2) Operators:  Contractors (Search Engine Optimizers) for lobbies, companies; Webmasters; Hostingservices
(3) Forums

Random searches:
Choose random searches extracted from a local log or build “random searches”
Advantage: Might be a better reflection of the human perception of coverage
Disadvantage: (1)Samples are correlated with source of log
(2)Duplicates
(3)Technical statistical problems (must have non-zero results, ratio average not statistically sound)

Random IP addresses
Advantages:
(1)Clean statistics
(2)Independent of crawling strategies
Disadvantages:
(1)Doesn’t deal with duplication
(2)Many hosts might share one IP, or not accept requests
(3)No guarantee all pages are linked to root page.
(4)Power law for # pages/hosts generates bias towards
sites with few pages.
(5)Potentially influenced by spamming (multiple IP’s for same server to avoid IP block)

Random walks:
View the Web as a directed graph. Build a random walk on this graph
Advantages:
(1)“Statistically clean” method at least in theory!
(2)Could work even for infinite web (assuming
convergence) under certain metrics.
Disadvantages:
(1)List of seeds is a problem.
(2)Practical approximation might not be valid.
(3)Non-uniform distribution 􏰁 Subject to link spamming

No comments:

Post a Comment