| With the rapid development of communication and Internet,various information increases exponentially.Text,the most typical information carrier,cannot make an exception.In order to control and retrieve valuable information,research of automatic text categorization becomes very important,and support vector machine(SVM) based text categorization is a hot field of research.However,for the information processing of the massive network,traditional sort of artificial is no longer appropriate and therefore more effective text categorization came into being.Text categorization is helpful to improve the on-line information retrieval effectiveness and efficiency,and it is the important aspects to promote personalized services and to improve access model to information,and it is the base of safe content.Support Vector Machine(SVM) is a new statistical learning method which proposed by Vapnik.The learning principle of SVM is to minimize the structural risk, which gives SVM better generalization.Great progress has been made in theoretical study and algorithmic realization of SVM recently,which has become a new technique of data mining to overcome traditional difficulties,such as dimension disaster or over-fitting,etc.In hydropower faults categorization field,the traditional practice is collecting on-site traffic signal analysis.It is difficult to implement,and not suitable to carry out academic research.In recent years,expert system is commonly used in hydropower faults categorization field by academic field.Because of its highly profession and complicated mechanism of hydropower faults,people are still not clear about a lot of hydropower faults.As a result,using expert system is not only a waste of a great deal of manpower and material resources,the effect really are not very prominent.This article innovatively uses support vector machine approach in statistical learning theory to optimize text categorization and applies it into hydropower faults categorization field.As far as possible without lowering the accuracy,it tries to improve the efficiency of categorization and reduce artificial in participation.This article introduces text categorization,data mining and machine learning to instead of its original predicate reasoning mechanism.It is a new exploration of this field and is of great realistic significance.Through the research about related technologies in the field of text categorization, and according to TF-IDF weights algorithm,the texts are described in the form of the weighted word vector by using vector space model as well as Chinese words segmentation technology.And it uses several different feature selection methods such as document frequency(DF),distribution(CHI),information gain(IG),as well as mutual information(MI),and so on.Then,because texts are described in form of vector,so Support Vector Machine technology is used for text categorization.A Chinese text classifier is designed,and ICTCLAS is used to cut a large number of texts into words,so as to achieve the purpose of categorization.A Chinese version of the classification algorithm is proposed,thresholds of different types are established, and through the establishment of high-frequency dictionary,as well as stop-words dictionary,it improves the accuracy of classification,and the algorithm has a very good application value.Through the research about related technologies of statistical study theory and Support Vector Machine technology,and the status of research and application of Support Vector Machine and the faced problems are presented.Especially through a detailed research and comparison are done about the choice of kernel functions in SVM,and a crucial role about the choice of kernel support vector machines for categorization is Found.The way SVM solves the non-linear separable problem is that:a map impliedly defined by kernel function is used to transfer the samples in the original feature space into a higher dimensional feature space.Therefore,the non-linear separable problem becomes a linear separable one.In the course of solving decision-making function,the computation can be conducted in the original feature space,thus greatly decreasing computational complexity in the higher dimensional feature space. In addition,through the research about multi-classification of support vector machine algorithm,this article found that the SVM application fields can be greatly extended.As long as the samples can be expressed in the form of vector,one-to-many support vector machine can be implemented in multi-classification.And try to apply multi-classification of support vector machine algorithm into hydropower faults classification field.It is a new attempt,not only can greatly reduce the cost of human intervention,but also laid a Solid theoretical foundation on other areas of the application.Through the study about LIBSVM package,which is made by Dr.Lin Zhiren of National Taiwan University,it is applied into the implementation of hydropower faults text categorization.LIBSVM package provides a linear,polynomial,RBF and Sigmoid function of the four commonly used for a nuclear option,it can more effectively address the problems.But also because of its characteristics of openness,it has attracted scholars all over the world to carry out the expansion.This paper selected SVM.NET 1.4-a.NET version of LIBSVM by Matthew Alastair Johnson, as the core of algorithm,which can accurately simulate a variety of support vector machine algorithms and laid a good foundation of its application in practice.This article sorts out the hydropower faults to five categories:power plant system fault,the generator control loop fault,and the bus transmission line fault,the DC system fault,as well as the main transformer system fault,a total of about 900 faults.And they are described in the style of texts,basically contents a large sample of the experimental conditions.Through the precision and recall rate and the average macro and micro-average evaluations,precision of this hydropower faults text classifier precision are all more than 90%in on-line version of kernel,polynomial functions,as well as nuclear RBF kernel.And it has been able to meet the needs of the application.In addition,through experimental data I found that precision and recall rate and the average macro and micro-average evaluations of RBF kernel are better than the other two core functions.Finally,a hydropower fault text classifier based on support vector machines is designed and implemented,it can import training texts and test texts of different categories and build different classification models and compute Precision and recall rate in different types.In addition,through the integrated SVM.NET,it can cloth samples as well as the points,and show a good visibility and operation,and it has a good prospect.It not only can be applied to the field of hydropower fault text categorization,and can be expected to have a lot of space applications in other related fields. |