Research And Implementation On Text Information Classification In Big Data

Posted on:2016-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zou

Full Text:PDF

GTID:2348330476955336

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

Thanks to the rapid development of internet information technology, computer has been widely applied in social life and work, which results in huge accumulation of network interactive text. Information resources are increasing exponentially. However, most of information has little concerned with actual research problems. Traditional text analysis is unable to achieve the expected effect due to the enormous scale of data. Therefore, researchers now are facing a challenging problem to obtain the high valuedensity target data from tremendous dataset quickly and accurately by processing and analysis.The paper aims to improve the existing feature selection method in the process of text classification to meet the requirement of choosing the representative feature subset of categories more accurately, so as to obtain target information from the massive redundant data accurately, efficiently and comprehensively.The main work accomplished in this paper is as follow:(1) According to the requirement of feature selection in big data text classification, this paper focuses on analyzing Chi-square statistics algorithm. Due to traditional Chisquare algorithm lay particular stress on low-frequency words, the method of feature selection has been put forward, which combine Chi-Square value and the term frequency of the feature in the specified category. Considering the effect of featureselection by the distribution of characteristics in different categories. By introducing the concept of concentricity and discreteness, we add modifying factors in Chi-square formula.(2) In order to improve the effect of text categorization, we choosing term frequency�inverse document frequency(TF-IDF) as the calculation method to get the weight of feature to construct the vector space model(VSM), normalizing the feature weights.(3) We choose the support vector machine(SVM) as the text classification method. In the process of training support vector machine classifier, I use 10-fold cross validation method in training samples to optimize the parameter C and the parameter ? in radial basis kernel, thus improving the performance of the classifier. In order to find out whether the classification effect is improved by using improving feature selection method, the improved Chi-square statistics method is applied to support vector machine. By the above research results, we designed a classification system for Chinese text that related to communication enterprises and belong to policies and regulations category.This topic implemented the classifier and used Fudan University corpus to do contrast experiments among information gain method,chi-square statistic method, HBM method proposed by [50] and the improved feature selection method in this topic. According to the experimental, the improved Chi-square statistics method getting better Precision and F1 formula, which can prove it works better than other compared algorithms in the results of classification in this system. We applied the research results of this paper in the classification of internet news that related to communication enterprises and belong to policies and regulations category. It also proves the effectiveness, accuracy and practical value of the research in this paper.

Keywords/Search Tags:

Feature Selection, CHI, TF-IDF, Support Vector Machine, Text Classification

PDF Full Text Request

Related items

1	Research On Text Classification System Based On Support Vector Machine
2	The Design And Application Of SSVM's Text Classification Based On Feature Selection Optimization
3	Research On Web Text Classification Based On Support Vector Machines
4	Research On Chinese Text Classification System Based On Support Vector Machine
5	Study On Text Classification Based On Rough Set And Support Vector Machine
6	Research On Text Classification Method Based On Support Vector Machine
7	Research On Text Classification Based On Feature Selection And Its Application
8	Text Classification Based On Machine Learning
9	Research On Text Emotion Classification Based On Improved Feature Selection Method
10	Research And Implementation On Text Information Classification In Big Data