Font Size: a A A

The Study Of A Chinese Word Segmentation Model Based On Multi-object Optimization

Posted on:2009-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:W L WangFull Text:PDF
GTID:2178360272465195Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The natural language has its nature of fuzzy and complexity. Nowadays no effective solution can solve the Chinese word segmentation problem both theoretically and practically.In real world, Chinese word segmentation also can't meet the need of application such as Chinese search engine, text classification and machine translation. It is found that many features are usually combined into a formula of probability, to form one objective function in the view of optimization. Even one objective function is related to many features, taking consideration of many such evaluation functions from different views should be fairly reasonable. Therefore, the process of solving the Chinese segmentation problem can be formulated as a multi-object optimization problem.The thesis adopts the Bigram frequency probability of sentence, Bigram PoS(part of speech) probability of sentence and the deviation of substrings' length as the multiple optimization objects. The results from the three optimization objects can make up a population, therefore some evolution operations, such as cross, mutation and selection can be performed on the population. The criterion of selection is based on the partial order which includes two parts, one is Pareto order related to object vector, the other is the cluster distance in order to keep the diversity of population. Inspired by the fact that nature evolution is impacted by the environment, an artificial environment constraint for word segmentation is also developed to guide the evolution of population.A prototype of the new model is implemented. The substrings are tagged with information values to give the clue of the words'property related to name or place, to the ambiguity position and to the certainty of words such as symbol, high frequency words in vocabulary, etc.A lot of experiments are conducted, including disambiguation test, segmentation test and Chinese name identification test based on multi-features. The results suggest that the multi-object model for Chinese word segmentation might be a potential novel approach.
Keywords/Search Tags:Chinese word segmentation, multi-object, optimization, information value tagging
PDF Full Text Request
Related items