Font Size: a A A

Research And Implementation On Text Information Classification In Big Data

Posted on:2016-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZouFull Text:PDF
GTID:2348330476955336Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Thanks to the rapid development of internet information technology, computer has been widely applied in social life and work, which results in huge accumulation of network interactive text. Information resources are increasing exponentially. However, most of information has little concerned with actual research problems. Traditional text analysis is unable to achieve the expected effect due to the enormous scale of data. Therefore, researchers now are facing a challenging problem to obtain the high valuedensity target data from tremendous dataset quickly and accurately by processing and analysis.The paper aims to improve the existing feature selection method in the process of text classification to meet the requirement of choosing the representative feature subset of categories more accurately, so as to obtain target information from the massive redundant data accurately, efficiently and comprehensively.The main work accomplished in this paper is as follow:(1) According to the requirement of feature selection in big data text classification, this paper focuses on analyzing Chi-square statistics algorithm. Due to traditional Chisquare algorithm lay particular stress on low-frequency words, the method of feature selection has been put forward, which combine Chi-Square value and the term frequency of the feature in the specified category. Considering the effect of featureselection by the distribution of characteristics in different categories. By introducing the concept of concentricity and discreteness, we add modifying factors in Chi-square formula.(2) In order to improve the effect of text categorization, we choosing term frequency–inverse document frequency(TF-IDF) as the calculation method to get the weight of feature to construct the vector space model(VSM), normalizing the feature weights.(3) We choose the support vector machine(SVM) as the text classification method. In the process of training support vector machine classifier, I use 10-fold cross validation method in training samples to optimize the parameter C and the parameter ? in radial basis kernel, thus improving the performance of the classifier. In order to find out whether the classification effect is improved by using improving feature selection method, the improved Chi-square statistics method is applied to support vector machine. By the above research results, we designed a classification system for Chinese text that related to communication enterprises and belong to policies and regulations category.This topic implemented the classifier and used Fudan University corpus to do contrast experiments among information gain method,chi-square statistic method, HBM method proposed by [50] and the improved feature selection method in this topic. According to the experimental, the improved Chi-square statistics method getting better Precision and F1 formula, which can prove it works better than other compared algorithms in the results of classification in this system. We applied the research results of this paper in the classification of internet news that related to communication enterprises and belong to policies and regulations category. It also proves the effectiveness, accuracy and practical value of the research in this paper.
Keywords/Search Tags:Feature Selection, CHI, TF-IDF, Support Vector Machine, Text Classification
PDF Full Text Request
Related items