Font Size: a A A

Dictionary Based Chinese Word Segmentation Algorithm And Its Application In Nutch System

Posted on:2013-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:T Y WangFull Text:PDF
GTID:2248330395959411Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of digital, networking and information technology in the rapidgrowth, an information as the core of the times has come. On the realization of informationfor Chinese information retrieval has become increasingly important. Chinese informationprocessing field as a basic subject, Chinese word segmentation technology is increasinglyvalued by people, Chinese word segmentation accuracy for Chinese information retrievalplays a very important role. Therefore, Chinese information retrieval has become the lifebloodof the information society and the development of the important foundation of knowledgeeconomy. Chinese information retrieval on many aspects of social lives and the social andeconomic development has produced inestimable effect. Since the last century after90time,to the Internet as the representative of the computer network has been rapid development. Theresulting information is huge. Many people think that now is the era of computerpopularization, the computer can help people cope with heavy work. As the amount ofinformation grow with each passing day and Chinese information retrieval is becoming moreand more important, which is implemented by Java Nutch search engine application has beenthe subject of extensive influence. It provides running its own search engine, all the tools thatyou need to use it, users can create their own internal network search engine, can also beestablished for the entire network search engine. It is no exaggeration to say, now people live,work, learning and communication are inseparable from the search engine, the Nutch searchengine application will be in the search engine that an information field occupy a space forone person. This article through the understanding of Chinese word segmentation and currentdevelopment of the three main current of the Chinese word segmentation algorithm, andanalyze, from the theory of whole word two, TRIE index tree, verbatim two three dictionaryforms of organization are analyzed and compared, and puts forward a new double word indexHashi dictionary mechanism, and through experiment double word segmentation demonstratethe superiority of Hashi. The double character hash indexing of the dictionary and the forwardmaximum matching algorithm combining, realize the dictionary based Chinese wordsegmentation algorithm. Based on the Nutch segmentation framework analysis and codemodification, realized Chinese participle algorithm as the plug into Nutch this search engineapplications, through testing, Chinese plug-in can make this search engine has good Chinese processing ability, so as to improve the efficiency of retrieval.
Keywords/Search Tags:Nutch, Chinese word segmentation, double character hash indexing, the maximummatching algorithm
PDF Full Text Request
Related items