| As the backbone of China’s transportation,railway plays an important role in the development of national economy and people’s livelihood.And safety is the premise of orderly and stable operation of railway system.With the rapid development of railway industry technology,all kinds of new equipment are continuously put into railway operation,but new problems arise,such as: which types of railway equipment have higher failure rate,how to describe different railway equipment faults structurally,and how to use railway equipment fault description to mine its internal rules.To solve the above problems,we need to find a text classification method of railway equipment inertia fault to identify and classify the massive railway fault text information.This paper starts from the source of fault text,expands the word segmentation thesaurus before text vectorization,obtains the equipment name,quality standard and station name related to railway system from authoritative websites such as National Railway Corporation and China railway inspection and Certification Center Co.,Ltd.,and generates the special thesaurus for railway equipment field.Combined with the Jieba word segmentation of railway special thesaurus,the fault description of equipment is segmented and the stop words are removed,so that the generated fault word segmentation text is closer to the effect of manual processing.After obtaining the word segmentation model,word2 vec algorithm is used to vectorize the word segmentation model to obtain the word vector which can represent the fault text;After that,LDA Algorithm is used to extract the features of the generated text vector,which provides a data source for the research of the subsequent classification algorithm.Then,a single classification algorithm model such as decision tree,KNN,support vector machine and gradient lifting decision tree is established for the processed data set,and the overall accuracy,recall rate and F1 value of the model are used as the evaluation criteria of classification effect.Then,according to the ensemble rules of ensemble classifier,each single classifier is used as the base classifier of stacking ensemble learning,and the decision tree is used on the meta classifier.Due to the strong imbalance of the data set used in this paper,we use borderline-Smote algorithm expands the minority classes in the data set,weights the base classifier based on the proportion of the classification accuracy of the base classifier for the minority classes to the overall classification accuracy,and establishes a railway fault text classification model based on weighted stacking ensemble learning.The results of this paper show that the railway domain specific word segmentation thesaurus can effectively represent the semantics of the original text,and the other string correlation and Pearson correlation coefficient can reach nearly0.9.Through the analysis of the experimental results,it is found that the weighted stacking ensemble learning model can effectively improve the accuracy for a small number of classes.Compared with the single classifier,the comprehensive performance is greatly improved,and compared with the traditional stacking model,the evaluation indexes are also improved. |