Font Size: a A A

Research On Sorted-neighborhood Method And Its Application In Chinese Data Cleaning

Posted on:2019-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:P G ZhangFull Text:PDF
GTID:2428330566974196Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet technology,data analysis plays an important role in all walks of life.In the process of data analysis,how to obtain a perfect and stable data source has become the focus of attention gradually.Many dirty data,such as data redundancy,deletion,uncertainty and inconsistency,directly affect the accuracy of subsequent data mining and the correctness of decision-making.The importance of data cleaning is self-evident.At the same time,as the second largest language in the world,Chinese data cleaning has been widely concerned by scholars.This paper focuses on the research of Chinese data cleaning based on the Sorted-neighborhood Method.There are many kinds of data cleaning,and repeated value cleaning is an important and challenging task.The method of English cleaning with better results is applied to that of Chinese directly in the initial research of Chinese repeated value cleaning.However,with the difference between semantics and usage,research results show that there are two main problems: the traditional cleaning algorithm can not adapt to Chinese semantic environment and can not effectively deal with the Chinese common homophone synonym words,which leads to the difference between the final cleaning results and raw data.In view of the shortcomings of traditional algorithms,a Chinese data cleaning algorithm based on the Sorted-neighborhood Method is proposed in this paper.In this paper,the traditional Sorted-neighborhood Method is applied to Chinese data cleaning directly.The result shows that the accuracy is much lower than that of English.After research,it is found that Chinese semantics is based on words,while traditional Sorted-neighborhood Method can not calculate similarity in terms of words.Meanwhile,the algorithm can not effectively determine whether similarity between synonyms is effective.In view of the above shortcomings,this paper gives an improvement idea: in the process of introducing edit distance to calculate similarity,Chinese participle is used to make every unit of similarity computation from single Chinese character into words to adapt to Chinese semantic environment.On this basis,the synonym lexicon is introduced into the similarity calculation,and the word bank is used as the standard to determine whether the two words are synonyms.The experimental results show that the improved Sorted-neighborhood Method calculate the similarity in terms of words,reflecting the fact that Chinese semantics is based on words,which not only reduces the number of comparison in the process of calculation,saves the running time of the algorithm,but also paves the way for the comparison of synonyms.The accuracy of the improved algorithm is higher than that of the traditional Sorted-neighborhood Method,and the expected effect is achieved.
Keywords/Search Tags:Data cleaning, Repeated value cleaning, Sorted-neighborhood Method, Edit distance
PDF Full Text Request
Related items