Font Size: a A A

Research On Word-Segmentation Based On Maximum Entropy Model

Posted on:2008-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:L J JiaFull Text:PDF
GTID:2178360215971647Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the development of the information industry of China, wordsegmentation has become a basal subject in Chinese information processing. From now on,the Segmentation algorithm can be divided into two forms: one is intelligent segmentation, theother is mechanical segmentation. The intelligent segmentation is also the segmentationmethod based on rule. But this method has high complexity, and it is difficult to realize, sointelligent segmentation is still in the experimental stage. The mechanical segmentationmethod has low complexity, and it is easy to realize, but it often makes mistakes when dealingwhit ambiguous phrase and unknown words recognition; the accuracy and speed ofsegmentation are close related with the size of lexicon. Chinese is different from the west, in aChinese sentence; there is not list separator between words. Chinese lexical restriction isnonstandard and it often changes, so it is a big trouble for the word segmentation. AutomaticChinese Segmentation is key task in natural language processing and computationallinguistics, Chinese segmentation become indispensable since its result directly affects manyapplications like parsing, semantic analysis, speech recognition, machine translation,information retrieval, and information filtering and so on. Automatic Chinese segmentationcan be used for automatically recognizing the Chinese words. Although there some researchefforts in this field, there are still some problems in the practical applications, which need tobe solved by further research.The maximum entropy approach is proved to be expressive and effective for the statisticslanguage modeling. As a statistical method, the framework of maximum entropy is efficientlyused. In its applications the accuracy is at or near the state-of-the-art. The model is easy tounderstand, and at the same time it can control subtle features and have reusability. Itsshortcoming is time and space consuming of training.In this dissertation, we first introduce the methods of word segmentation, MaximumEntropy Method, Parameter Estimation and Feature Selection algorithm. Through comparisonand analysis of feature selection algorithm, an improved feature selection method is proposed to improve the speed of feature selection. The core work of the paper is designing andimplementing a Chinese segmentation system based on Maximum Entropy Model. Thesystem includes some modules such as pretreatment, model training, name entity recognize,part-of-speech tagging. Last, the paper validates the system's performance by experimentation.Compared with other segmentation systems, it made a better segmentation efficiency andaccuracy of segmentation.In this paper, the major work is as follows:(1) Research on the principle of maximum entropy model, parameter estimation andfeature selection algorithm.(2) Feature selection: Maximum Entropy Model is not actually involved in the featureselection issue because it is only to determine an appropriate probability model. However,feature space is large, how to choose a less redundancy, representative feature is veryimportant. In allusion to the above problem, we Propose an. improved feature selectionalgorithm.(3) System Construction: based on the maximum entropy model, a Chinese wordsegmentation system is established. In the process of building systems, we recognized NamedEntity firstly, and then deal with the word segmentation, the final results show that thismethod made a better segmentation efficiency and accuracy of segmentation.
Keywords/Search Tags:Word Segmentation, Maximum Entropy Model, Parameter Estimation, Feature Selection, Name Entity Recognize
PDF Full Text Request
Related items