Once a defect occurs in the capacitive equipment,especially the type of defect belongs to “emergent” or “major”,it may bring great interference to the normal operation of power grid,even huge losses.Therefore,mining defect text data of the capacitive equipment to find out the accurate information when the defect occurs in the capacitive equipment.It is of great significance and value to predict the occurrence time and the types of defect for the capacitive equipment.The defect data of capacitive equipment is mainly text data from the daily operation and maintenance records of power grid enterprises,many fields are described in natural language.The description of these fields is not standardized.The clerk has a strong arbitrariness to input the defect description.The length and content of defect text input by different personnel may also be different.Thus,it is a great challenge to mine the defect text.The defect text of capacitive equipment often has these features such as high complexity,large amount of data,and difficulty in processing.The thesis focuses on the mining of defect text.And our research tasks and results are listed below.(1)In this thesis,the term frequency-inverse document frequency(TF-IDF)algorithm is used to encode the defect text.The vector dimension of each defect text sample after encoding is 10675.However,after encoding,the capacitive defect text with the largest number of words has only 136 words,in the capacitive defect text vector with the largest number of words,that is the valued elements only account for about 1.3% of the total vector dimension.After encoding the defective text with TF-IDF,the defect text features are very sparse.Therefore,this thesis uses a non-negative matrix factorization algorithm to reduce the dimensionality of the TF-IDF encoded defective text.(2)For the shortcomings of TF-IDF,we also used feature expansion algorithm to encode defect text.That is,only the words with high depicting ability could be selected from the 10675 words.The selected words were considered as the feature sapce of the defect text.Thus,the dimensions of vector were reduced.Then,the sample were expanded through the mutual information between these words.(3)Based on TF-IDF,TF-IDF non-negative matrix factorization and feature expansion approaches,we used k-means clustering and hierarchical clustering methods to cluster these preprocessing defect data sets respectively.The expermental results showed that the k-means clustering approach achieved the best performance with the TFIDF non-negative matrix factorization approach,the silhouette coefficient is 0.92,and the number of optimal categories is 163.(4)Based on the optimal clustering model,we used naive Bayes classifier,random forest,and bidirectional encoder representation from transformers(BERT)to classify the original defect text and those with feature processing.The experimental results showed that all the three methods could effectively improve the performance after feature processing.Among the three classifiers,the BERT achieved the best performance after feature extension,and the classification accuracy was increased from 0.98 to 0.99.The counterparts of naive Bayes and random forest were improved from 0.74 and 0.86 to 0.78 and 0.88,respectively.(5)After classifying the defect text,we extracted the knowledge triples based on dependency analysis approach,and the Neo4 j was selected as the database to store and search the defect text of capacitive equipment. |