Font Size: a A A

The Research Of Text Classification Based On Feature Selection And Topic Model

Posted on:2014-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:L XiangFull Text:PDF
GTID:2248330398979212Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of global informatization and the popularity of the Internet, this new information transmission way changed people’s way of life, but the Internet also presents the data as the information carrier in the explosion. In order to find the information needed by the user more rapidly, accurate and comprehensive, the effective organization and mining about these data is a big challenge in the field of information science today. In order to cope with the challenge about data, text mining was proposed. Text mining mainly include text classification, text clustering, methods of document summary. Because the text classification is a very important method that people can acquire knowledge and know about things, and it usually play a key role in natural language processing and analyzes, machine learning, topic identification and other fields, so the research of automatic text classification based on the text content has become one of research hot spot.Text classification refers to the subject according to the predefined categories, for each document in the document collection to determine the type of a process, which generally includes text preprocessing, feature selection, classifier’s selection and training, results evaluation and feedback and other steps. In the steps, the result of the feature selection has a significant effect on the result of classification, so the research on it become one of the hot on text classification. Feature selection is to reduce the dimension of text feature, in order to achieve the feature of removing the feature has meaningless on the text classification thereby to achieve the final goal that increasing the effect of classification. The traditional feature selection methods are based on the mathematical statistics, ignoring the semantic relationship between terms from text. This paper try to introduce semantic information combined with the traditional feature selection methods, which can make classification algorithm contain mathematical statistics and semantic information in order to improve the effect of text classification. This paper introduces the text classification correlation technique, statistical topic models.In this paper, the related research work is listed as follows:(1) Analyzed the disadvantages of CHI and MI feature selection, provided a new feature selection FSCM based on this two methods. Against the deficiencies about CHI and MI feature selection, the method of FSCM suggested corresponding parameters modification and fused together.(2) Analyzed the disadvantages of the traditional method contain semantic information. The LDA model which is extended out of LSI and pLSI, it can explore the semantic information which has been hidden in the documents without the help of outer knowledge bases. Based on this characteristic, which proposed a method that combination of CHI and LDA model. The application of this method of classification algorithm for test classification.
Keywords/Search Tags:text classification, feature selection, FSCM, LDA model, semanticinformation
PDF Full Text Request
Related items