Font Size: a A A

Study And Implementation On Chinese Word Segmentation Algorithm Of Search Engine Based On Nutch

Posted on:2012-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:D MaFull Text:PDF
GTID:2178330335989349Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, network information resources grow exponentially. It is even more important that how to pick up valuable information fast and accurately from mass data. The emergence of search engine solves the difficulties of users to retrieve information effectively. The Chinese word segmentation technology determines the accuracy of search engine to find information. Use word as a unit as a Key value of the search engine index, it will greatly enhance the accuracy of search engines results, while reducing the computational search process.Existing segmentation algorithm consists of string-based matching method, sub-lexical based on the statistical and sub-lexical based on understanding. After studying the existing segmentation algorithm and learning dictionary mechanism, understand the advantages and disadvantages of different algorithms and dictionaries. Combined with the length of Chinese term frequency, propose matching algorithms based on the first word hash and the longest term and improved algorithm. After proven, the algorithm has considerably reduced the time complexity, which has practical value.Finally, based on the understanding of Nutch own segmentation technology, adding chinese segmentation plugin. After verified, it further illustrates the importance of chinese segmentation on search engines.
Keywords/Search Tags:Nutch, Search engine, Chinese word segmentation, First character hash
PDF Full Text Request
Related items