Font Size: a A A

Research On Analysis And Mining Method Of Railway System Fault Text Data Based On Machine Learning

Posted on:2021-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:J YuanFull Text:PDF
GTID:2392330614472048Subject:Transportation planning and management
Abstract/Summary:PDF Full Text Request
With the introduction of machine learning technology,the level of informatization in the field of railway system fault management has been further improved.In the current railway system fault information database,a large amount of text data about faults is stored,these data record the time when the fault occurred,the vehicle number,the system to which the fault belongs,the nature of the fault,the fault consequence and the fault description.Among them,the manually recorded fault text data records the occurrence of faults in the railway system in detail,contains a lot of unexploited information,and is an important resource for fault analysis.But text data is more difficult to process than other data,and manual analysis takes a lot of time.For this problem,text mining based on machine learning methods can enable the computer to quickly process a large amount of fault text data,and improve the efficiency and level of fault information management of the railway system.The traditional text mining method has certain limitations.In order to efficiently and accurately mine the semantic information in the fault text data,it is necessary to further improve the text mining technology.How to improve the traditional text mining method to meet the needs of practical problems is becoming a hot topic in research.Based on the in-depth understanding of related research,a series of methods for mining and analyzing railway system fault text data based on machine learning algorithms are proposed,and the actual data is used for verification and evaluation,which provides decision opinions for railway system fault management.The main research contents are as follows:(1)This paper proposes a classification analysis method of railway system fault text data based on the MI-RFE feature selection method.This method classifies the railway system fault text data according to the cause of the fault,and classification results can be used as fault cause diagnosis.In order to avoid errors in semantic understanding of railway professional vocabulary,this article established a professional vocabulary to make Chinese word segmentation as accurate as possible.This paper proposes the MIRFE feature selection method based on the traditional mutual information feature selection method.The experimental results prove that the classification accuracy and F1(F-measure)value of the MI-RFE feature extraction method are improved.In order to find a suitable text classification algorithm,this paper uses Naive Bayes,SVM(Support Vector Machine)and KNN(K-Nearest Neighbor)algorithms for text classification.The results prove that the F1 values obtained using the three methods are similar.The SVM algorithm can get a higher classification accuracy.(2)For the problem that the classification model cannot be trained without supervision,this paper proposes a two-stage clustering algorithm of HCA(hierarchical cluster analysis)+ k-means to perform clustering analysis of railway system fault text data,which is important for subsequent fault analysis.The information is also convenient for the development of solution of the same type fault.The word vectors of the fault text data of the railway system have the characteristics of high dimensionality and sparseness,and are difficult to calculate and consume space for storage.To solve this problem,this paper uses the PCA(Principal Component Analysis)method to reduce the dimension of the word vector.Aiming at the problem that the k value of the clustering number is uncertain,this paper uses the sum of squared errors within the clusters as an indicator to select the k value of the clustering number.To solve the problem that the k-means clustering algorithm is affected too much by initial points,this paper proposes the HCA + k-means two-stage clustering algorithm to determine the reasonable range of choices for initial points.The results of the experiment show that the clustering effect of using HCA + k-means clustering algorithm is better than the original k-means clustering algorithm.(3)Based on the LDA theme model,this paper carries out theme mining and analysis on the text data of railway system faults.Aiming at the problem of uncertain number of topics,a method for selecting the number of topics based on perplexity is proposed.Since the characteristics of the LDA topic model are based on word frequency statistics,this paper uses word frequency vectors for feature extraction.Using the variational inference EM(Expectation Maximization)algorithm to solve the LDA topic model,the documenttopic matrix and the topic-vocabulary matrix are obtained.Since the document-topic matrix does not directly reflect the relationship between topics,this paper introduces topic strength as an indicator to select hot topics.Since the document-topic matrix does not directly reflect the strength and weakness of topics,topic strength is introduced as an indicator for selecting hot topics.Through experiments,the topics of the fault text data of the railway system were mined and the hot topic of the fault was found,which provided decision-making basis for preventing hot spot failure.
Keywords/Search Tags:Text mining, text classification, text clustering, topic mining, railway system fault text data
PDF Full Text Request
Related items