| Patent information is a collection of the latest scientific and technological innovation development level,with the rapid development of computer technology and scientific and technological innovation,the amount of data in all walks of life continues to increase,patent data also shows explosive growth,and the phenomenon of "data explosion,knowledge poverty" has begun to appear.The traditional patent mining method is slow and difficult to find a large amount of technical information and valuable knowledge hidden in the patent text,which is far from meeting the existing application needs,so how to efficiently and quickly mine the value information in the patent in large-scale patent data resources has become a hot topic.In recent years,text data mining technology has developed vigorously,providing a strong boost for the mining of patent information,which can effectively help enterprises to carry out innovative research and development,grasp the development trend of new technologies,and break the competition of enterprises with patent innovation as the blade.Text data mining technology is an effective way to solve the lack of data but knowledge,based on patent data,this paper uses deep clustering and outlier detection technology in text data mining technology to mine the value of patents,obtain valuable knowledge information in patent texts,the main work is as follows:(1)Aiming at the polysemantic limitations of the traditional vector representation method,deep clustering is prone to the problem of feature embedding and separation of clustering process,and a patented clustering method that integrates BERT and improves the depth autoencoder is proposed.First of all,the vector initialization representation of the patent text was carried out by using BERT to solve the problem of polysemy of a word in the patent text.Secondly,it is proposed to associate the Gaussian Mixture Model(GMM)with the autoencoder,construct a clustering module(CM)of a single implicit layer autoencoder,and embed the CM into the deep auto-encoder(DAE)to form a DAE-CM model to solve the problem of embedding and clustering separation.Experiments verify the equivalence of CM and GMM,and the accuracy of the DAE-CM model is improved compared with the existing model in the dataset,and finally the performance of the patented clustering model is further evaluated by the patented dataset.(2)Aiming at the problem that existing patent novelty measurement methods need to rely on specific domain knowledge and the intervention of experts,a method for identifying patent novelty is proposed by a fully automated system that does not rely on domain-specific knowledge.First,Ro BERTa is used to represent patent vectors to solve the problem that the traditional static word embedding model cannot represent patent polysemantics.Secondly,the density distribution of data points is used and combined with information entropy to improve the local outlier factor algorithm(LOF)to determine the number of outliers and data point sets,improve the accuracy of outlier detection,and the improved LOF calculates the novelty score of each patent represented by vectors on the digital scale.Experimental studies were conducted on 2560 patents on medical imaging technology,and two verifications are carried out on new patents with high novelty scores measured to prove the effectiveness of the proposed method.Experiments show that the score of the novelty patent measured by the proposed method is significantly correlated with the relevant patent indicators in the existing literature,and the identified new patents have a higher technical impact. |