| China’s railways have established professional safety monitoring systems for vehicle,aircraft,engineering,electricity and vehicles.The system contains lots of textual,voice,image and other unstructured data,of which the text data accounts for the vast majority.The railway works section is responsible for the repair and maintenance of railway lines and related equipment,so the text processing mining in the railway works section can provide help for the railway safety maintenance.The violation record data of railway works section mainly refers to the text data which generated when the railway inspectors record and report the specific violation behavior when the staff has the violation operation in the inspection stage of railway works.The inspector is responsible for the licensing of the non-compliance data.Because some inspectors are not familiar with the business,the content of the violation operation is inconsistent with the actual licensing situation.Such data is called abnormal labeling data.Abnormal labeling data may cause inconvenience to data management of railway system and may leave hidden danger to railway safety.Through the analysis of the data of the works section of a railway bureau,it is found that the text content is mostly short texts and the data of different categories are seriously imbalanced.Based on the current machine learning algorithm,this paper combines feature selection,sampling and ensemble learning to realize the detection of abnormal labeling text data.To summarize this paper,all the innovation points and contributions are mainly as follows:(1)This paper makes an improvement on the traditional feature selection and conduct synonym fusion,and proposes a new feature selection algorithm which integrates the TFIDF algorithm of category information,and proves the effectiveness of the algorithm through experiments.(2)This paper proposes a method of bidirectional text oversampling.Due to the imbalance of data,it is easy to have underfitting on the categories with small sample size in the training of classification model.Bidirectional text oversampling can effectively alleviate the underfitting phenomenon of classification model on a few classes.The experiments show that the oversampling method proposed in this paper can effectively improve the overall classification effect.(3)This paper proposes a method combining ATPE with the XGBoost algorithm.XGBoost performs well,but the railway service is complex and there are many types of recorded data.The XGBoost model contains many hyper parameters,so the manual adjustment parameter method and random search method are limited.The XGBoost algorithm combining ATPE can find a better parameter combination in a limited number of iterations,and further improve the classification ability of XGBoost model to the text data of railway works.Finally,based on the above model,this paper implements a railway abnormal text labeling system,which can effectively find abnormal labeling data from railway violation record data,and provide licensing function to provide help for railway safety management. |