Font Size: a A A

Chinese Text Classification Algorithm

Posted on:2013-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:P F MaFull Text:PDF
GTID:2218330371957250Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Decades of the development of information technology and the network have made communication of people more convenient. The development of human civilization has been greatly pushed. But the development of technology has also brought many problems, just like mass expansion of information, harmful information and so on. How to get effectively management and classify the data fast, become the problem which information science urgently need to address. With the development of technology, text classification has grown up as an independent discipline which has a great practical value, and it has been widely applied in many fields. For instance, information retrieval, search engines, public opinion analysis and so on.There are many difficulties in text classification, because the vector space model of text is high-dimensional and has a big sparseness. Information gain is the most commonly used method of feature selection, but the effect on unbalanced dataset is not so beautiful. Support vector machine (SVM) is the most suitable for text classification method, but there still exist a lot of problems in SVM, just like complexity, long time for training, high sensibility for parameters etc. It's difficult to apply. For the above problems, this paper works as follows:Summarizes and analyzes the research background of text classification and related technologies. Researches the basic theories of feature selection method and support vector machine (SVM). For the unbeautiful effect on unbalanced datasets because of the ignorance for features disposition, this paper defines distribution information inside and between classes combined with Theil entropy, puts out a new information Gain method based on Theil entropy and named T-IG. In order to solve the high sensibility for parameters of SVM, this paper offers a new classification algorithm GLOA-SVM combined with GLOA optimization algorithm, and proves the effectiveness of GLOA-SVM. Finally, this paper designs and realizes a Chinese text classification prototype systems based on T-IG and GLOA-SVM, and proves the effectiveness of new algorithms in Chinese text classification.
Keywords/Search Tags:Chinese text classification, support vector machine, feature selection, information gain
PDF Full Text Request
Related items