Chinese Spelling Correction Research In Search Engines Based On Statistical Model

Posted on:2011-04-13

Degree:Master

Type:Thesis

Country:China

Candidate:Z P Chen

Full Text:PDF

GTID:2178360308460881

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The search engine for the Internet has played a more and more important role as in the development of Web 2.0. For the number of search engine users are growing quickly, and together with the requirements for search engines are higher and higher, its functions continuously improved. The spelling auto-checking and correcting function is a very important additional technology, and has been applied and promoted widely.In the Chinese search engine, the definition of spelling auto-checking and correcting function is that when a user input some keywords to search on Internet through the search engine, it will return a large number of searching results which including all the similar words to those origin keywords (such as the phrase appears in homophone different words, or spelling error), and users will see the keyword result which the system speculated and provided in the search results page.A N-gram statistical language model which been set up based on a method that analysis context statistical information completely is firstly introduced to the field of Chinese spelling correction in search engines. In terms of the characteristics of Chinese language, the model is also analyzed detailedly to determined the necessary parameters. On this basis, the language model will be optimized and closer to real language.The method of the TF/IDF weighting, which calculate and compare the preliminary checking results, is introduced and then the better results of spell checking and correcting will be returned.All the theoretical models proposed in this paper were all verified based on Nutch and Hadoop distributed search engine platform(data size from about 100K to 5GB), and analysis results are presented by the chart. It verified that the models can achieve good results by the statistical analysis and comparison completely when the input keywords were error, and the more contextual information, the higher error correction recall and precision.

Keywords/Search Tags:

Spelling Correction, N-grams Model, TF/IDF Weight, Search Engine, Distributed Computing

PDF Full Text Request

Related items

1	Research On Input Error Correction Technology Of Search Engine Based On Statistical Language Model
2	Search Engine Error Correction Algorithm And Error Correction Bad Case Mining
3	N-gram Index Structure For Semantic Based Mathematical Formulas
4	Distributed Based On The Search Engine Irst Improvements
5	Applications Of Spelling Correction Techniques In Information Retrieval And Text Processing
6	Optimization And Implementation Of Chinese Spelling Error Detection And Correction Algorithm
7	A Similar Image Search Engine Based On Millions Of Images And Distributed Computing
8	Efficient Searchable Encryption In Cloud
9	The Research And Design Of Search Engine Based On Distribution
10	Research On Chinese Spelling Correction In Question And Answer System