Font Size: a A A

The Smoothing Technique Based On Mutual Information For Statistical Language Model

Posted on:2006-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y W HuangFull Text:PDF
GTID:2168360155972925Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Natural Language Process is an attracting and challenging field in computer science. Its purpose is to establish a computing model by which can simulate man's language cognizing processes. Whereas, the intelligence of today's computer is far behind the man's and cannot being mentioned in the same level with it. The factors obsessing its development are various. Data Sparseness in Statistical Language Model (SLM for shortly) is one of the problems needs being solved in the Natural Language Process field. This paper aims at the widely popularized and used Statistical Language Model, researches the existing ways of establishing models and the Smoothing Techniques, and brings forward a new way to establish a model which can suffice in unitary feature of probability and a new smoothing technique based on Mutual Information. This smoothing technique combines the ideas of mutual information (MI for shortly) and entropy, imposes the theory of non-linearity system optimization. The main outputs of this paper are listed below: First of all, The paper firstly introduces the theory of probability and information about the knowledge of statistical language model, then bring in the smoothing techniques of statistical language model exists, and analyzed smoothing principle and the method of realize. The main body of this paper brings several ways of establishing the SLM. These previous ways are hardly to use because they can't suffice in unitary feature of probability. Thereby, this paper puts forward a new way to establish the SLM by adding the same symbol both at the beginning and the ending of each paragraph in the corpora, so that the model could meet the unitary feature of probability. Furthermore, this paper brings forward a new smoothing technique based on Mutual Information. After analyzing of the value of mutual information of events in the model, this smoothing method lowers the probabilities of the events whose mutual information is higher than the average mutual information; and as for those mutual information is lower, increases their probabilities; for the unseen events' probabilities, backs-off to the lower model. And it gains the coefficient of the smoothing formula based on the theory of non-linearity system optimization by minimizing the perplexity of the model, so as to ensure the superiority of the method. At the end part of the paper, it compares the new smoothing technique with the existing smoothing techniques. Test the perplexity of model in test set. The experiment datum shows the superiority of the method for it's perplexity belowing 40%. After being programmed, the arithmetic imposed in the paper will be an important functional model used in the exploiting Chinese Automatic Segmentation System. The conclusion for this paper together with the further study works come to the end of the paper.
Keywords/Search Tags:Natural language processing, Statistical language model, Sparse data, Smoothing technique, Mutual information, Perplexity
PDF Full Text Request
Related items