Font Size: a A A

Research And Application Of Enterprise Information Automatic Classification System Based On Text Mining

Posted on:2017-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WuFull Text:PDF
GTID:2308330485469647Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, enterprises are facing problems that collecting and processing a lot of unstructured information. Information classification is an important management means. The traditional labor classification management not only labors consumption but also working inefficient. This article points out an automatic classification method that based on Text mining to enhance the efficiency for enterprises’information classification.Based on the study of a variety of text categorization, and then using Support Vector Machine (SVM) as the main algorithm for the information classification. Considering to information gather from network which would cause the imbalance of sample distribution. Supplemented with KNN Algorithm, due to SVM classifiers are working inefficient while near Hyperplane which means there were K samples instead of a sample to classify information to improve the overall classification results.Firstly, in view of the unstructured enterprise information, there is a pre-process for enterprise information. Enterprise information is under process such as word segmentation, removing stop words etc. And those results will statistics the term frequency, document frequency etc. At the same time, considering to the imbalance of enterprise information gathering from network. Information Gain is adopted as feature selection method. Dispersion and concentration parameters that stronger ability of category representation introduced to reduce the dimensionality of feature list. The eigenvectors of enterprise information are constructed with feature words that benefit information classification. The default penalty factor C and the kernel function parameter were conducted trial of four kinds of commonly used kernel function experiments that confirms RBF kernel function utilization. Through grid-search method and the five fold cross validation method to find out the optimal kernel parameters G. On the basis of this, the SVM information classifier produced after training. The SVM classifier’s Support vectors working as a KNN classifier’s training sample, while considering to the enterprise information obtained may cause the data imbalance problem. KNN classifier introducing a weighting factor for adjusting the weights between the categories, and experimentally determine the K value of KNN classifier. In combination with SVM and KNN classifier, the threshold θ value is determined by experiments. SVM-KNN classification model adapting vector supported KNN classifier based on weight while information classification approaching to SVM classifier hyperplane. In contrast, SVM classifier results are to be obtained directly while in a long distance from hyperplane.In this article, information classification experiment conducted in a large-scale enterprise in a specific industry. It is verify that the efficiency of SVM-KNN classification model based on a large number of enterprises. It is better adapted to the imbalance of enterprise information sample, which makes the enterprise information classification more accurate.
Keywords/Search Tags:enterprise information, text categorization, Support vector machine (SVM)
PDF Full Text Request
Related items