Font Size: a A A

Applied Research Of Chinese Word Segmentation In Agricultural Vertical Search Engine

Posted on:2014-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:T BaiFull Text:PDF
GTID:2268330401454304Subject:Agricultural mechanization project
Abstract/Summary:PDF Full Text Request
This paper first analyses Chinese word segmentation theory,methods and existing primaryproblems in depth,and focuses on the application of statistical models in the field of naturallanguage processing. On this basis,this paper proposes a kind of Chinese word segmentationalgorithm besed on dictionary and statistical language model specific to special requirements ofagricultural vertical search. This method builds segmentation matrix by improving the method ofword omni-segmentation algorithm, achieves all types of ambiguity recognition, generates coarsesegmentation result sets, and then using n-gram model to choose the best result with the highestprobability from the segmentation result, and get the final segmentation result through identifyingunknown word by using the POS tagging method based on maximum entropy model. Finally,aChinese word segmentation prototype system based on this algorithm is designed andimplemented.The proposed Chinese word segmentation method is improved in three aspects, The first is toidentify the characteristics of word segmentation of landmark significance to establish thecharacteristic font through large-scale corpus, and segment initially the pre-processed statementsset through using the characteristic words, so as to effectively reduce the string length in coarsesegmentation stage. The second is to use the improved segmentation model, build segmentationmatrix by POS Tagging,to be able to effectively detect ambiguity boundary, achieve all types ofambiguity recognition,and filter out all of the segmentations containing ambiguous, to calculatethe probability by using bigram model to choose the best form of segmentation sets. The last is toestablish the agricultural professional term, Chinese name, Chinese organization namethesaurus,and choose the appropriate features templates, generate sample data, and then identifythe unknown words through the maximum entropy model.This paper designed three experiments, the first one is comparing the performance of theimproved Omni-word Segmentation and traditional Omni-word Segmentation;the second one iscomparing the unknown words recognition rate based on the maximum entropy model withdifferent contexts window in4POS tagging set;the third one is comparing the comprehensiveperformance of ICTCLAS,Paoding,IKAnalyzer and this prototype system. Result showed that therecall ratio reached93.6%, the precision ratio reached91.7%, F1reached92.6%, recall ratio ofOOV reached77.2%, and precision ratio of OOV reached90.1%.
Keywords/Search Tags:Chinese word segmentation, agricultural vertical search engine, Omni-word Segmentation, bigram, maximum entropy model
PDF Full Text Request
Related items