Font Size: a A A

The Research On Chinese Text Classification

Posted on:2010-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:X H LiFull Text:PDF
GTID:2178360275480511Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of information age, information manifests an explosive growth throughout Internet. It is the key discussion that how to mine the users interested information in such massive information, It is powerless for manual classifying. But automatic text categorization can save substantial human and financial resources, avoid many defects such as long cycle, high cost and low efficiency which artificial classification brought about. So the automatic classifying by computer has become a key technology for solving these problems, at the same time, Chinese word segmentation is one of the fundamental components in Chinese information processing, and it is also frequently used in the text operation of Chinese text classification. At the present, the main research of text categorization is text representation, feature selection, the improvement of categorization algorithm.In fact, Maximum entropy model are a constrained optimization problem. In 1957s, E T Jaynes applied maximum entropy principle to various fields of science and technology as a principle or method. But also made the concept of information entropy and principles go out of the thermodynamic area. And maximum entropy model is a more general statistical modeling technique, In natural language processing task, a lot of problems can be attributed to the statistical classification problem, a lot of machine learning can be found their applications here. Maximum entropy have a strong ability to express knowledge, which is a very perfect on the model in mathematics, maximum entropy was applied to the natural language processing field by many researchers, and achieved a better performance, the use of a wide range of researchers. In recent years, researchers in natural language processing pay many attentions to it, its application including part-of-speech tagging, semantic disambiguation, phrase identification, machine translation and so on. The study analyzed maximum entropy and inequality maximum entropy model, researched the feature generating methods and features selection algorithm in Chinese text classification tasks, Next, we mainly analyzed the statistical language model, maximum entropy model, the smoothing techniques and iterative algorithm, Then discussed the problems which existed in using maximum entropy to classify Chinese text, and base on deeply studying of maximum entropy theory, we introduced inequality maximum entropy, but also applied it to text classification mission. And then Discussed disadvantages which caused by existing feature selection methods that generating excessive features lead to time costing and low accuracy, the study Proposed combined information gain, mutual information and the chi-square statistics of these methods in a reasonable way to select features, and achieved the purpose of dimensionality reduction. Experiment results show that our methods proposed in this paper is efficient for inequality maximum entropy, but also they have a good expandability. In final, we clarified methods of the original features in Chinese text, in other words, researched how to select features set by automatic non-dictionary segmentation mechanism, gave some typical Chinese text word segmentation algorithm, moreover, under conjunction with Chinese own unique characteristics, we studied the Non-dictionary cutting word algorithm, a improved new Chinese word segmentation algorithm is given and applied. And we proved its high efficiency by experiment.Finally, I summarized the main content of this thesis, and future of the text categorization technique was given.
Keywords/Search Tags:Chinese text categorization, word segmentation, feature selection algorithm, maximum entropy model, inequality maximum entropy model
PDF Full Text Request
Related items