The first step of precessing is to convert this byte sequence, the digital documents in a file or on a web server, into a linear sequence of documents. And we need to determine the correct encoding of that document, such as UTF-8 and ASCII, also the document format.
Then, we need to determine the document unit for indexing. For very long document, the issue of indexing granularity comes out. This issue is a tradeoff.
Given a character sequence, we need to chop it into pieces, this is called tokens.
We may drop common terms, the stop words, such as a and an. But the meaning of phrase may change if drop common terms.
Token normalization is the process of canonicalizing tokens so that matches normalization occur despite superficial differences in the character sequences of the to equivalence kens. The most standard way to normalize is to create equivalence classes.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
For phrase queries, one approach to handling phrases is to consider every pair of consecutive terms in a document as a phrase. However, the standard solution is positional index. The strategies of biword indexes and positional indexes can be fruitfully combined.
There are two search data structure for dictionaries, respectively hashing and search trees.
For hashing, there is no easy way to find minor variants of a query term, because these could be hashed to very different integers. The search trees overcome many of these issues. However, for binary tree, it should be balanced. This costs a lot if the search tree changes frequently, for example deletion. There comes out the B-tree, a balanced tree.
A query such as mon* is known as a trailing wildcard query, because the * symbol occurs only once, at the end of the search string.
Final technique for tolerant retrieval has to do with phonetic correction: misspellings that arise because the user types a query that sounds like the tar- get term. The main idea here is to generate, for each term, a “phonetic hash” so that similar-sounding terms hash to the same value.
No comments:
Post a Comment