| As the carrier of information,data is a research topic in many fields,such as intelligent computing,information fusion and so on,in order to discover the laws behind the massive data.However,the existence of dirty data greatly increases the complexity of knowledge discovery,so data cleaning is a very important work.Scholars at home and abroad have researched systematically on the cleaning of English data,but there are huge differences between Chinese and English,the cleaning methods for English data are not completely suitable for Chinese.In addition,in the face of the surge of Web data,the computing and storage capacity of a single server is gradually inadequate.Based on this,this paper studies a distributed cleaning method for similar duplicate Chinese text.Based on Hadoop distributed platform and related cleaning methods,aiming at the shortcomings of low cleaning accuracy and slow cleaning speed of traditional single machine cleaning methods,a parallel cleaning method is proposed,which combines BERT model and k-means clustering algorithm.The main tasks are as follows:(1)Aiming at the phenomenon of synonym and polysemy in Chinese,this paper analyzes the defect of losing semantic information in traditional vectorization methods.In the process of text to vector,position vector is introduced to obtain the context characteristics of words,and the vector is adjusted dynamically according to semantics,so that polysemy can obtain different vector representation in different contexts.At the same time,the parallel implementation of the process is designed based on Hadoop to pave the way for further similar duplicate data cleaning;(2)The idea of clustering is used to realize parallel cleaning of similar duplicate data.In this process,cosine similarity algorithm,Canopy algorithm and k-means clustering algorithm provided by Mahout algorithm library are used to detect semantically similar duplicate data and cluster them.Analyze the source code of k-means in Mahout,add Combine process to do secondary development,reduce the communication consumption between Map and Reduce and improve the cleaning efficiency.In order to reduce the impact of random k value on clustering,firstly,the data is processed with Canopy rough clustering to get the approximate clustering center,and then the similar text is clustered with k-means algorithm to achieve the purpose of cleaning;(3)Sort out the data cleaning process and conduct comparative experiments on various data sets.The experimental results show that compared with other methods,the text mathematical expression under this method can carry more real semantic information,which makes the subsequent similar duplicate text clustering have higher accuracy.At the same time,the distributed experiment shows that parallel computing design of this article has good speedup and scalability. |