Font Size: a A A

Research On Input Error Correction Technology Of Search Engine Based On Statistical Language Model

Posted on:2018-11-22Degree:MasterType:Thesis
Country:ChinaCandidate:K QianFull Text:PDF
GTID:2348330536977536Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Now,withthe rapid development of information technology,the search engine on the Internet has played an increasingly important role,and the growing number of Internet users' requirements on search engine have become increasingly high.Among them,the search engine input error correction function is a very important additional technology,and has been more widely used and popularized.Therefore,the study of search engine error correction technology for the development of search engines has an important far-reaching significance.Error correction technology is one of the important research topics in natural language processing.The research on error correction of Chinese text is relatively late in English.It is divided into dictionary-based and statistical-based methods.Dictionary-based error correction is limited by the size and content of the dictionary,while the statistical model-based approach is based on a large number of examples to analyze the relationship between the internal language,no special dictionary to achieve.Statistical models for error correction are based on mutual information probability,N-gram model,the combination of Chinese decision-making and so on.This paper presents a method to analyze the statistical information of the context completely.In order to demonstrate the feasibility of this method,a distributed search engine platform based on Nutch and Hadoop was experimentally verified.This paper mainly completed the following work:In order to build a good search engine platform,this paper first introduces the mainstream indexing mechanism-inverted index.In this paper,we introduce the performance model of inverted index and the compression technique.At the same time,the performance of the index mechanism is analyzed and compared with the general index to calculate the time complexity and spatial complexity of the inverted index creationand then lead to a good application of inverted index,the framework of the search engine toolkit Lucene.From Lucene we build search engine Nutch.As the experimental environment requiring large data,the distributed search engine built by Nutch + Hadoop is introduced in detail.Due to the limitations of Chinese theoretical research,the N-gram language model has been established for the Chinese corpus,and the language model has been analyzed in detail.Then we determine the parameters necessary for the language model,and the datasparsity problem is solved by the smoothing technique.Based on a large number of corpus,the key words after correcting by N-gram model may have the same result,and The TF-IDF is used to calculate the weights of the results of the preliminary processingand screen the result to get the best result set.
Keywords/Search Tags:Spelling Correction, N-grams Model, TF-IDF Weight, Search Engine, Distributed Computing
PDF Full Text Request
Related items