Font Size: a A A

Chinese Spelling Correction Research In Search Engines Based On Statistical Model

Posted on:2011-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z P ChenFull Text:PDF
GTID:2178360308460881Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The search engine for the Internet has played a more and more important role as in the development of Web 2.0. For the number of search engine users are growing quickly, and together with the requirements for search engines are higher and higher, its functions continuously improved. The spelling auto-checking and correcting function is a very important additional technology, and has been applied and promoted widely.In the Chinese search engine, the definition of spelling auto-checking and correcting function is that when a user input some keywords to search on Internet through the search engine, it will return a large number of searching results which including all the similar words to those origin keywords (such as the phrase appears in homophone different words, or spelling error), and users will see the keyword result which the system speculated and provided in the search results page.A N-gram statistical language model which been set up based on a method that analysis context statistical information completely is firstly introduced to the field of Chinese spelling correction in search engines. In terms of the characteristics of Chinese language, the model is also analyzed detailedly to determined the necessary parameters. On this basis, the language model will be optimized and closer to real language.The method of the TF/IDF weighting, which calculate and compare the preliminary checking results, is introduced and then the better results of spell checking and correcting will be returned.All the theoretical models proposed in this paper were all verified based on Nutch and Hadoop distributed search engine platform(data size from about 100K to 5GB), and analysis results are presented by the chart. It verified that the models can achieve good results by the statistical analysis and comparison completely when the input keywords were error, and the more contextual information, the higher error correction recall and precision.
Keywords/Search Tags:Spelling Correction, N-grams Model, TF/IDF Weight, Search Engine, Distributed Computing
PDF Full Text Request
Related items